<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="rss.xsl"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
    <channel>
        <title>recode hive Blog</title>
        <link>https://www.recodehive.com/blog</link>
        <description>recode hive Blog</description>
        <lastBuildDate>Tue, 19 May 2026 00:00:00 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <item>
            <title><![CDATA[Why Data Engineers Make Better Business Analysts Than MBAs Do]]></title>
            <link>https://www.recodehive.com/blog/data-engineers-vs-mbas</link>
            <guid>https://www.recodehive.com/blog/data-engineers-vs-mbas</guid>
            <pubDate>Tue, 19 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[MBAs learn to analyze businesses. Data engineers live inside the data those businesses actually run on. Here's why that difference matters more than any credential — and what both sides should learn from each other.]]></description>
            <content:encoded><![CDATA[<p>The VP of Marketing walked into the quarterly business review with a slide deck. Forty-three slides. The headline on slide 7 read: <strong>"Customer acquisition cost down 18% QoQ."</strong></p>
<p>The data engineer sitting in the back of the room knew something was wrong before the slide finished loading.</p>
<p>Three weeks earlier, she had noticed a JOIN condition in the CAC calculation pipeline that was double-counting leads from the new referral program. She had filed a ticket. The ticket was still open. The number on that slide — the one the VP was presenting to the CEO, the one that would inform next quarter's $2M budget allocation — was wrong by a factor that would embarrass everyone in the room once someone finally ran the corrected query.</p>
<p>She raised her hand.</p>
<p>That moment, the data engineer who knows the data is wrong before the analyst finishes presenting it, is not an accident. It is the structural consequence of a difference in how MBAs and data engineers relate to business data. One group studies it. The other one builds the systems that produce it.</p>
<p><strong>What this post argues:</strong></p>
<ul>
<li class="">Why proximity to data systems is a more durable analytical advantage than business frameworks</li>
<li class="">The five specific skills data engineers have that MBAs typically don't — and why each one matters for business analysis</li>
<li class="">Where MBAs still have a genuine edge (and data engineers should stop pretending otherwise)</li>
<li class="">What the best business analysts of the next decade will look like — and why neither camp gets there alone</li>
</ul>
<p>This is going to step on some toes. That is intentional.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="first-a-fair-definition-of-terms">First, a Fair Definition of Terms<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#first-a-fair-definition-of-terms" class="hash-link" aria-label="Direct link to First, a Fair Definition of Terms" title="Direct link to First, a Fair Definition of Terms" translate="no">​</a></h2>
<p>Before making the argument, it is worth being precise about what is actually being compared.</p>
<p><strong>MBA</strong> here means someone trained in the traditional business school tradition: frameworks for strategy (Porter's Five Forces, BCG Matrix), finance (DCF, unit economics), and organizational behavior. They are taught to analyze businesses from the outside, to take a set of numbers, apply a framework, and produce a recommendation.</p>
<p><strong>Data engineer</strong> here means someone who designs, builds, and operates the systems that collect, transform, store, and serve data. They spend their days inside pipelines, schemas, and query plans in direct contact with how data is actually produced, not just how it eventually appears in a report.</p>
<p><strong>Business analyst</strong> here means the role that sits between raw data and business decisions: translating what the data says into what the business should do.</p>
<p>The argument is not that MBAs are bad at analysis in general. It is that data engineers have a structural advantage specifically in the business analyst role, because that role increasingly depends on understanding the data infrastructure underneath the numbers — not just the numbers themselves.</p>
<p><img decoding="async" loading="lazy" alt="&amp;quot;Two Vantage Points&amp;quot; diagram" src="https://www.recodehive.com/assets/images/two-vantage-points-b6cafc32390ca79cadff9401863d97a5.png" width="1448" height="1086" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reason-1-data-engineers-know-when-the-number-is-wrong">Reason #1: Data Engineers Know When the Number Is Wrong<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#reason-1-data-engineers-know-when-the-number-is-wrong" class="hash-link" aria-label="Direct link to Reason #1: Data Engineers Know When the Number Is Wrong" title="Direct link to Reason #1: Data Engineers Know When the Number Is Wrong" translate="no">​</a></h2>
<p>This is the most important one, and it is the one that is hardest to teach.</p>
<p>Every metric in a business, be it revenue, churn, CAC, LTV, retention, is produced by a pipeline. That pipeline has JOIN conditions, aggregation logic, filter predicates, and data source assumptions baked into it. When the pipeline has a bug, or when upstream data quality degrades, or when the definition of a metric silently changes because someone modified a dbt model, the number in the dashboard changes too.</p>
<p>An MBA looking at that number sees: a trend. A data engineer who built or maintains that pipeline sees: the JOIN that changed last Tuesday, the source table that started receiving nulls on day 14 of last month, the filter that was added to "clean up outliers" that accidentally excluded an entire customer segment.</p>
<p>This is not a hypothetical. It happens constantly, in every organization that runs on data, at every scale. The question is not whether there are errors in your metrics, there are. The question is who in the room knows about them before the decision gets made.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>A 2023 survey by Monte Carlo Data found that data engineers spend an average of <strong>40% of their time</strong> on data quality issues, finding them, diagnosing them, and fixing them. That is not a cost center. That is 40% of someone's professional life spent developing an intimate understanding of where and how business data breaks.</p></div></div>
<p>The MBA in that quarterly business review learned Porter's Five Forces. The data engineer learned that the CAC pipeline double-counts referral leads. Both are forms of knowledge. Only one of them catches the error before the budget gets misallocated.</p>
<p><img decoding="async" loading="lazy" alt="&amp;quot;Where the 40% goes&amp;quot;" src="https://www.recodehive.com/assets/images/data-engineer-time-split-d4ff92556b629d65648478c1dc480e40.png" width="1666" height="944" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reason-2-they-understand-the-difference-between-what-data-says-and-what-data-means">Reason #2: They Understand the Difference Between What Data Says and What Data Means<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#reason-2-they-understand-the-difference-between-what-data-says-and-what-data-means" class="hash-link" aria-label="Direct link to Reason #2: They Understand the Difference Between What Data Says and What Data Means" title="Direct link to Reason #2: They Understand the Difference Between What Data Says and What Data Means" translate="no">​</a></h2>
<p>Here is a question that sounds simple and is actually hard: <strong>Is a spike in daily active users good news?</strong></p>
<p>The MBA answers: yes, obviously. Growth is good.</p>
<p>The data engineer asks three questions before answering anything:</p>
<ol>
<li class="">Did the event tracking code change recently? (A new <code>screen_view</code> event being fired twice could double DAU artificially.)</li>
<li class="">Did the definition of "active" change in the metrics layer?</li>
<li class="">Is this spike uniform across platforms, or is it isolated to one app version that might have a tracking bug?</li>
</ol>
<p>This is not paranoia. This is pattern recognition earned by having debugged dozens of "spikes" that turned out to be instrumentation errors, schema migrations, or upstream data source changes. Data engineers develop a strong prior that anomalies in data are more likely to be measurement errors than real business events because, in their experience, that is usually true.</p>
<p>Business analysis requires exactly this skepticism. The job is not to report what the number says. It is to assess whether the number is trustworthy, what it actually measures, and what legitimate conclusions can be drawn from it. Data engineers are trained for this by doing it wrong enough times that it becomes instinct.</p>
<p><strong>A real pattern, seen repeatedly across teams:</strong></p>
<table><thead><tr><th>Scenario</th><th>MBA interpretation</th><th>Data engineer's first question</th></tr></thead><tbody><tr><td>Revenue up 12% MoM</td><td>Strong growth signal</td><td>Did the billing pipeline change?</td></tr><tr><td>Churn down 3%</td><td>Retention improving</td><td>Was the churn definition updated?</td></tr><tr><td>Support tickets up 40%</td><td>Product quality issue</td><td>Did we change the ticket tagging logic?</td></tr><tr><td>Page load time improved</td><td>Engineering win</td><td>Is the new monitoring missing slow requests?</td></tr></tbody></table>
<p><img decoding="async" loading="lazy" alt="&amp;quot;Same Spike, Two Reactions&amp;quot;" src="https://www.recodehive.com/assets/images/same-spike-two-reactions-96685bcb238e19380c04112337e3336d.png" width="1536" height="1024" class="img_ev3q"></p>
<p>The MBA is not wrong to interpret those signals. The data engineer is right to interrogate them first.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reason-3-they-think-in-systems-not-snapshots">Reason #3: They Think in Systems, Not Snapshots<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#reason-3-they-think-in-systems-not-snapshots" class="hash-link" aria-label="Direct link to Reason #3: They Think in Systems, Not Snapshots" title="Direct link to Reason #3: They Think in Systems, Not Snapshots" translate="no">​</a></h2>
<p>MBA training is heavily oriented around snapshots: a financial model at a point in time, a competitive analysis as of this quarter, a market sizing exercise based on current data. The analytical unit is the report.</p>
<p>Data engineering is, fundamentally, about systems that produce data continuously over time. The analytical unit is the pipeline, a thing that runs repeatedly, handles changing inputs, breaks in specific ways under specific conditions, and accumulates state.</p>
<p>This shapes how you think about business problems in ways that matter.</p>
<p>When an MBA sees declining retention, they reach for a segmentation analysis: which cohort is churning, what do those users have in common, what intervention addresses the segment. This is useful analysis.</p>
<p>When a data engineer sees declining retention, they also ask: is the retention calculation correct, is it consistent across cohorts, are we measuring the same thing for users who signed up six months ago as for users who signed up last week, did the product change in a way that makes the old metric definition no longer comparable?</p>
<p>The MBA is doing cross-sectional analysis. The data engineer is doing longitudinal systems thinking — asking whether the measurement is stable across time, not just whether the trend is meaningful within one period.</p>
<p>This difference shows up in business analysis as the gap between finding a pattern and understanding whether the pattern is real.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p><strong>The classic example:</strong> A SaaS company sees retention improving in their cohort analysis. The data engineer checks whether the cohort definition changed. It did — the team quietly started excluding users who never completed onboarding from the retention denominator. Retention "improved" because the measurement changed, not because users stopped churning. The MBA writes a memo about the success of the new onboarding flow. The data engineer spots the denominator change in a Git commit.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reason-4-they-know-what-it-costs-to-answer-a-question">Reason #4: They Know What It Costs to Answer a Question<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#reason-4-they-know-what-it-costs-to-answer-a-question" class="hash-link" aria-label="Direct link to Reason #4: They Know What It Costs to Answer a Question" title="Direct link to Reason #4: They Know What It Costs to Answer a Question" translate="no">​</a></h2>
<p>Here is something MBA programs do not teach: some business questions are expensive to answer, and the cost of answering them should factor into whether you ask them.</p>
<p>Data engineers know this instinctively. They have seen a well-meaning analyst write a query that full-scanned a 10TB table to answer a question that could have been answered with a 50MB aggregate. They have been paged at 2am because a dashboard query took down a production database. They have estimated the engineering cost of building the data infrastructure required to answer a question that turned out not to need answering.</p>
<p>This changes how you frame business analysis questions. When a data engineer considers a business question, they are simultaneously asking:</p>
<ul>
<li class="">
<ol>
<li class="">Is this data available, or does it need to be collected?</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class="">If collected, how clean is it, and what cleaning effort is required to use it?</li>
</ol>
</li>
<li class="">
<ol start="3">
<li class="">What is the query cost of answering this at the granularity the question implies?</li>
</ol>
</li>
<li class="">
<ol start="4">
<li class="">Is the question answerable at all with the data that exists, or is it being asked in a form that sounds precise but cannot be operationalized?</li>
</ol>
</li>
</ul>
<p>The last one is underrated. A lot of business questions, stated precisely, cannot be answered with existing data. "What is our true customer lifetime value?" sounds like a concrete question. A data engineer knows that answering it requires solving a customer identity resolution problem, a revenue attribution problem, and a survivorship bias problem before the math even starts and that the data to solve all three may not exist in a form that supports the precision implied by the question.</p>
<p>An MBA will build a DCF model around an LTV number. A data engineer will ask how the LTV was calculated and whether the denominator includes the customers who churned before their first purchase. These are not the same conversation.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reason-5-they-have-built-things-that-failed-in-production">Reason #5: They Have Built Things That Failed in Production<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#reason-5-they-have-built-things-that-failed-in-production" class="hash-link" aria-label="Direct link to Reason #5: They Have Built Things That Failed in Production" title="Direct link to Reason #5: They Have Built Things That Failed in Production" translate="no">​</a></h2>
<p>There is a specific kind of knowledge that only comes from building systems that fail in production: the knowledge that the gap between how something is supposed to work and how it actually works is almost always larger than you expect.</p>
<p>Data engineers live in that gap. A pipeline that processes customer events correctly in staging fails in production when the API starts sending Unicode characters in a field that was always ASCII in the test environment. A join that works perfectly on last month's data produces duplicates on this month's data because an upstream system changed its primary key generation logic. A metric that was correct for two years becomes incorrect when the product introduces a new pricing tier that the original calculation logic never anticipated.</p>
<p>The cumulative effect of building, breaking, debugging, and fixing data systems is a deeply skeptical relationship with any number that comes out of a system you didn't build yourself. This skepticism is not cynicism. It is calibration.</p>
<p>Business analysis depends on calibrated skepticism about data. The analyst who trusts every number in the dashboard is going to make bad recommendations. The analyst who knows, from experience, that dashboards lie in specific ways and in predictable circumstances is going to ask the right questions before drawing conclusions.</p>
<p>MBAs are trained in analytical frameworks. Data engineers are trained, by production to distrust the inputs to those frameworks until proven otherwise. In business analysis, that is often the more valuable skill.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-mbas-still-have-the-edge-and-data-engineers-should-admit-it">Where MBAs Still Have the Edge (And Data Engineers Should Admit It)<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#where-mbas-still-have-the-edge-and-data-engineers-should-admit-it" class="hash-link" aria-label="Direct link to Where MBAs Still Have the Edge (And Data Engineers Should Admit It)" title="Direct link to Where MBAs Still Have the Edge (And Data Engineers Should Admit It)" translate="no">​</a></h2>
<p>This argument would be dishonest if it stopped here. MBAs have genuine advantages in business analysis that data engineers tend to lack, and those advantages are not trivial.</p>
<ol>
<li class="">
<p><strong>Stakeholder communication.</strong> The ability to take a complex finding and present it clearly to a non-technical audience, to a CFO, a board, a product team, is a skill that MBA programs drill explicitly and data engineering programs largely ignore. Data engineers frequently know the right answer and communicate it in a way that nobody acts on. That is an analytical failure, even if the analysis is technically correct.</p>
</li>
<li class="">
<p><strong>Business context and domain knowledge.</strong> MBA training includes deliberate exposure to finance, operations, marketing, strategy, and organizational behavior. Data engineers often develop deep expertise in one or two functional areas, wherever their pipelines touch and shallow knowledge everywhere else. A data engineer who works on the payments pipeline knows a lot about transaction data and relatively little about go-to-market strategy. Business analysis requires breadth.</p>
</li>
<li class="">
<p><strong>Framework fluency.</strong> Porter's Five Forces, the BCG matrix, unit economics, customer segmentation frameworks, these are not just jargon. They are shared vocabulary that allows business analysts to communicate efficiently with executives and cross-functional stakeholders. Data engineers who lack this vocabulary can be technically right and organizationally invisible.</p>
</li>
<li class="">
<p><strong>Comfort with ambiguity in the absence of data.</strong> Sometimes there is no data for the decision that needs to be made. Sometimes the best analysis is qualitative, customer interviews, expert judgment, market intuition. MBA training includes frameworks for making decisions under genuine uncertainty. Data engineers can be paralyzed by the absence of clean data, waiting for a complete dataset before forming a view.</p>
</li>
</ol>
<p>The honest summary: data engineers are better at knowing whether the data is trustworthy. MBAs are better at knowing what to do with it once trust is established. The best business analysts do both.</p>
<p><img decoding="async" loading="lazy" alt="&amp;quot;The Ideal Analyst Venn&amp;quot;" src="https://www.recodehive.com/assets/images/ideal-analyst-venn-b39ae650752eae88417d3679b2fedefb.png" width="1536" height="1024" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-the-best-business-analysts-of-the-next-decade-will-look-like">What the Best Business Analysts of the Next Decade Will Look Like<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#what-the-best-business-analysts-of-the-next-decade-will-look-like" class="hash-link" aria-label="Direct link to What the Best Business Analysts of the Next Decade Will Look Like" title="Direct link to What the Best Business Analysts of the Next Decade Will Look Like" translate="no">​</a></h2>
<p>The role of business analyst is changing faster than either MBA programs or data engineering teams are adjusting to.</p>
<p>Five years ago, business analysis meant querying data that someone else built and explaining what it showed. Today, it increasingly means understanding the systems that produce the data, the definitions embedded in the pipelines, the quality characteristics of each source, and the engineering cost of the insights that the business is asking for. The data infrastructure is no longer a background condition of business analysis. It is part of the analysis itself.</p>
<p>This shift favors data engineers. Not because MBAs cannot learn the technical side, they can, and many of the best analysts today have done exactly that, but because the default starting point of a data engineer (systems thinking, data skepticism, production-failure experience) is closer to where the role is going than the default starting point of an MBA (framework fluency, stakeholder communication, comfort with abstraction).</p>
<p>The analyst who will be most valuable in 2030 is probably not a data engineer who learned to give better presentations, though that person is useful. It is probably someone who started in data engineering, developed genuine business and communication fluency, and can operate at both layers simultaneously — who can tell you that the CAC number is wrong because of a JOIN condition and also tell you what to do about the marketing budget given the corrected number.</p>
<p>That person is rare. Both communities should be trying to produce more of them.</p>
<p><img decoding="async" loading="lazy" alt="&amp;quot;The Evolution of Business Analysis&amp;quot;" src="https://www.recodehive.com/assets/images/ba-evolution-timeline-15ae554e1c8789f404c989e50bf17d43.png" width="1774" height="887" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-practical-implication-for-hiring-managers">The Practical Implication for Hiring Managers<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#the-practical-implication-for-hiring-managers" class="hash-link" aria-label="Direct link to The Practical Implication for Hiring Managers" title="Direct link to The Practical Implication for Hiring Managers" translate="no">​</a></h2>
<p>If you are hiring business analysts and you are only looking at MBA credentials, you are filtering out a large population of candidates who have the specific skills that increasingly define excellent business analysis.</p>
<p>Some questions worth asking of any analyst candidate - MBA or data engineer:</p>
<ul>
<li class="">
<ol>
<li class="">"Walk me through a time when you questioned a metric that the business was relying on. What did you find?"</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class="">"How do you validate that a number in a dashboard is trustworthy before you use it in a recommendation?"</li>
</ol>
</li>
<li class="">
<ol start="3">
<li class="">"Describe a business question you were asked that turned out to be unanswerable with existing data. What did you do?"</li>
</ol>
</li>
<li class="">
<ol start="4">
<li class="">"How would you explain your most complex analysis to someone who has never seen the underlying data?"</li>
</ol>
</li>
</ul>
<p>The first three questions tend to favor data engineers. The fourth tends to favor MBAs. The best candidates handle all four.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-practical-implication-for-data-engineers">The Practical Implication for Data Engineers<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#the-practical-implication-for-data-engineers" class="hash-link" aria-label="Direct link to The Practical Implication for Data Engineers" title="Direct link to The Practical Implication for Data Engineers" translate="no">​</a></h2>
<p>If you are a data engineer who wants to move into business analysis, the technical credibility is already there. The gap is almost always on the communication and business context side.</p>
<p>Specifically:</p>
<p><strong>Learn to write for non-technical audiences.</strong> Your analysis is only as valuable as the decision it informs. If the decision-maker cannot understand your analysis, it will not inform the decision regardless of how technically correct it is.</p>
<p><strong>Develop opinions about the business, not just the data.</strong> Business analysts are paid to have views, not just to report numbers. Build the habit of ending every analysis with a recommendation, not just a finding.</p>
<p><strong>Learn enough finance to speak the language of decisions.</strong> Unit economics, margin, payback period, ROI, these are not complicated concepts, but they are the vocabulary in which business decisions get made. Fluency costs a few weeks of deliberate study and pays dividends for a career.</p>
<p><strong>Stop waiting for perfect data.</strong> The business will not wait. Learn to state your assumptions explicitly, quantify your uncertainty, and make a recommendation anyway. That is what business analysis actually looks like in practice.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-takeaways">Key Takeaways<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#key-takeaways" class="hash-link" aria-label="Direct link to Key Takeaways" title="Direct link to Key Takeaways" translate="no">​</a></h2>
<p><strong>Data engineers are closer to the truth of the data.</strong> They know where it breaks, when to distrust it, and what the numbers actually measure. This is not a minor advantage in business analysis. It is the foundation the entire role sits on.</p>
<p><strong>MBA training optimizes for communication and framework fluency.</strong> These are real skills that data engineers typically underinvest in. The ability to turn a correct analysis into a decision requires them.</p>
<p><strong>The distinction between "knowing the data is wrong" and "knowing what to do when it's right" maps almost exactly onto the skill gaps between these two communities.</strong> The most valuable analysts close both gaps.</p>
<p><strong>Proximity to data systems is becoming a core business analysis competency.</strong> As data infrastructure complexity grows, the analyst who does not understand the systems producing the data is increasingly at a disadvantage — not just technically, but analytically.</p>
<p><strong>The future belongs to neither camp exclusively.</strong> It belongs to people who take the best of both: the systems thinking and data skepticism of the data engineer, and the communication fluency and business judgment of the MBA. Both communities should be building toward that combination, not defending their territory.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="frequently-asked-questions">Frequently Asked Questions<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#frequently-asked-questions" class="hash-link" aria-label="Direct link to Frequently Asked Questions" title="Direct link to Frequently Asked Questions" translate="no">​</a></h2>
<p><strong>Q: Isn't this just survivorship bias? You're describing the best data engineers, not the average one.</strong></p>
<p>Fair point. The average data engineer is not a strong business analyst, they may have all the technical instincts described above but communicate them poorly, lack business context, or have no interest in the strategy side. The argument is not that all data engineers are better business analysts. It is that the skills data engineering develops are more directly applicable to modern business analysis than MBA training is, when those skills are coupled with business communication ability.</p>
<p><strong>Q: Do MBAs actually lack technical data skills? Many MBA programs now teach analytics.</strong></p>
<p>MBA programs have added data analytics courses, and some graduates are technically capable. But there is a meaningful difference between coursework in SQL or Tableau and two years of debugging production pipelines. The experiential depth of a working data engineer, the calibration that comes from building systems that fail and fixing them, is not replicable by a semester of coursework.</p>
<p><strong>Q: Isn't the real answer just to hire both and have them work together?</strong></p>
<p>Yes, and many strong organizations do exactly this. But the question is about individual capability in the business analyst role, not team composition. The argument is that a single data engineer with communication skills is often more effective in that role than a single MBA without data infrastructure fluency because the errors that compound silently in business analysis tend to live in the data layer, not the framework layer.</p>
<p><strong>Q: What about domain-specific industries like finance or healthcare where the MBA's domain knowledge is critical?</strong></p>
<p>Domain knowledge matters enormously, and in highly specialized industries, the MBA's domain fluency may outweigh the data engineer's infrastructure intuition. The argument applies most cleanly to tech and data-intensive consumer businesses, where the data infrastructure is the business in a meaningful sense and errors in data systems directly translate into errors in business decisions.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references-and-further-reading">References and Further Reading<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#references-and-further-reading" class="hash-link" aria-label="Direct link to References and Further Reading" title="Direct link to References and Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://www.montecarlodata.com/state-of-data-quality/" target="_blank" rel="noopener noreferrer" class="">Monte Carlo Data — State of Data Quality 2023</a></li>
<li class=""><a href="https://www.recodehive.com/blog/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">RecodeHive — Medallion Architecture Explained</a></li>
<li class=""><a href="https://www.recodehive.com/blog/batch-vs-stream-processing" target="_blank" rel="noopener noreferrer" class="">RecodeHive — Hidden Cost of Streaming Pipelines</a></li>
<li class=""><a href="https://www.gartner.com/en/documents/6648734" target="_blank" rel="noopener noreferrer" class="">Gartner — Data Quality Market Guide 2024</a></li>
<li class=""><a href="https://www.getdbt.com/analytics-engineering/transformation/" target="_blank" rel="noopener noreferrer" class="">dbt Labs — The Analytics Engineering Handbook</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/data-engineers-vs-mbas#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p><strong>Aditya Singh Rathore</strong> is a Data Engineer focused on building modern, scalable data platforms on Azure. He writes about data engineering, cloud architecture, and real-world pipelines on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a>, turning hard-won production lessons into content anyone can apply.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Data engineer who made the move into business analysis? MBA who learned the data engineering side? Drop your story in the comments, the best takes on this come from people who've lived both sides.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>data-engineering</category>
            <category>career</category>
            <category>business-analysis</category>
            <category>data-driven</category>
            <category>analytics</category>
            <category>data-strategy</category>
            <category>opinion</category>
            <category>leadership</category>
        </item>
        <item>
            <title><![CDATA[How We Used Purview Data Catalog to Reduce Onboarding Time for New Data Engineers from 2 Weeks to 3 Days]]></title>
            <link>https://www.recodehive.com/blog/microsoft-data-purview</link>
            <guid>https://www.recodehive.com/blog/microsoft-data-purview</guid>
            <pubDate>Tue, 19 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[New data engineers were spending 2 weeks just figuring out what data existed and who owned it. Here's how we used Microsoft Purview's catalog, lineage graph, and business glossary to cut that to 3 days — with the exact configuration that made it work.]]></description>
            <content:encoded><![CDATA[<p>The ticket came in on a Wednesday. A new data engineer, two weeks into the job had spent four days trying to understand why the <code>customer_lifetime_value</code> column in the Gold layer showed different numbers than the same field in the BI report.</p>
<p>It was not a pipeline bug. The column existed in two places: once in <code>gold.customer_metrics</code> (calculated monthly) and once in <code>gold.customer_ltv_rolling</code> (calculated on a 90-day rolling window). Both were correct. Neither was documented. Nobody had told the new engineer either table existed, let alone the difference between them.</p>
<p>He had been Slacking three different senior engineers to chase down an answer that should have taken five minutes to find independently.</p>
<p>That ticket was the moment we decided to fix onboarding.</p>
<p><strong>What this post covers:</strong></p>
<ul>
<li class="">The exact problem structure that made onboarding slow and why it was invisible to us until we measured it</li>
<li class="">How we used Microsoft Purview's three core capabilities, searchable catalog, lineage visualization, and business glossary with ownership metadata to eliminate the "who do I ask?" loop</li>
<li class="">The configuration steps that actually moved the needle, with real before-and-after numbers for each</li>
<li class="">What we got wrong the first time, and the one thing that made the second attempt stick</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-problem-measured">The Problem, Measured<a href="https://www.recodehive.com/blog/microsoft-data-purview#the-problem-measured" class="hash-link" aria-label="Direct link to The Problem, Measured" title="Direct link to The Problem, Measured" translate="no">​</a></h2>
<p>Before we changed anything, we ran a structured retrospective with four recent hires across different seniority levels. We asked one question: <strong>"In your first two weeks, where did you spend time that you wish you hadn't?"</strong></p>
<p>The answers sorted into three buckets with near-perfect consistency:</p>
<table><thead><tr><th>Time sink</th><th>Avg. hours lost</th><th>Root cause</th></tr></thead><tbody><tr><td>Finding which table to use for a given metric</td><td>18 hrs</td><td>No searchable catalog; tables discovered by asking people</td></tr><tr><td>Understanding upstream dependencies before touching a pipeline</td><td>14 hrs</td><td>No lineage visibility; had to trace JOINs manually through code</td></tr><tr><td>Figuring out who owns a dataset / who to ask about it</td><td>11 hrs</td><td>Ownership lived in people's heads or stale Confluence pages</td></tr><tr><td>Reading existing pipeline code to understand business logic</td><td>9 hrs</td><td>Expected; we accepted this as non-reducible</td></tr></tbody></table>
<p>Total addressable time: <strong>43 hours across the first two weeks.</strong> The fourth bucket, reading code, we treated as irreducible. A new engineer needs to read the code. The first three buckets were pure friction. They produced no learning, only delay.</p>
<p>The target was to get those 43 hours to under 5. That is the difference between a two-week ramp and a three-day one.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The 9 hours spent reading pipeline code did not disappear after the Purview rollout. It actually went down slightly, because engineers who understand the lineage before reading the code read it more efficiently. But we did not count on that in our projections.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="our-data-estate-before-purview">Our Data Estate Before Purview<a href="https://www.recodehive.com/blog/microsoft-data-purview#our-data-estate-before-purview" class="hash-link" aria-label="Direct link to Our Data Estate Before Purview" title="Direct link to Our Data Estate Before Purview" translate="no">​</a></h2>
<p>To understand what we configured, you need to know what we were working with. The estate was not enormous, but it was complex enough to be disorienting for someone new:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Data Sources</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── Azure SQL Database (transactional - orders, customers, products)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── Kafka → Event Hubs (clickstream, app events)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── Third-party REST APIs (marketing attribution, support tickets)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">│</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADF Pipelines (ingestion, ~40 pipelines)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">│</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADLS Gen2</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── bronze/    (raw, partitioned by source and date)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── silver/    (cleaned, Delta tables, ~180 tables)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">└── gold/      (aggregated, serving layer, ~60 tables)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">│</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Azure Synapse Analytics (SQL serving for BI)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">│</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Power BI (dashboards, ~25 reports)</span><br></div></code></pre></div></div>
<p>240 tables across three layers. 40 ADF pipelines. 25 Power BI reports. No central documentation. New engineers navigated this through a combination of institutional knowledge, Slack archaeology, and luck.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-three-purview-capabilities-that-moved-the-needle">The Three Purview Capabilities That Moved the Needle<a href="https://www.recodehive.com/blog/microsoft-data-purview#the-three-purview-capabilities-that-moved-the-needle" class="hash-link" aria-label="Direct link to The Three Purview Capabilities That Moved the Needle" title="Direct link to The Three Purview Capabilities That Moved the Needle" translate="no">​</a></h2>
<p>We did not use every Purview feature. We used three, in a deliberate order, because each one built on the last.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="capability-1-searchable-data-catalog-week-1-unlock">Capability 1: Searchable Data Catalog (Week 1 unlock)<a href="https://www.recodehive.com/blog/microsoft-data-purview#capability-1-searchable-data-catalog-week-1-unlock" class="hash-link" aria-label="Direct link to Capability 1: Searchable Data Catalog (Week 1 unlock)" title="Direct link to Capability 1: Searchable Data Catalog (Week 1 unlock)" translate="no">​</a></h3>
<p>The first and most urgent problem: new engineers could not find tables without asking someone. The bronze, silver, and gold layers had consistent naming conventions internally, but there was no way to search across all 240 tables by business concept. If you wanted the table behind the "monthly active users" metric, you had to know to look in <code>gold.user_engagement_monthly</code>, a name that is only obvious in retrospect.</p>
<p>Purview's catalog solves this through asset scanning and enrichment. Here is the scanning configuration we used for ADLS Gen2:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// Purview Scan Configuration — ADLS Gen2 Silver Layer</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"silver-layer-full-scan"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"kind"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"AdlsGen2Msi"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"properties"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"scanRulesetName"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"AdlsGen2"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"scanRulesetType"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"System"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"collection"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"referenceName"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"data-platform-silver"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"CollectionReference"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"dataSourceName"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"adls-prod-silver"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"scanLevel"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Full"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"fileFormats"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"Delta"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Parquet"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"CSV"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"filter"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"excludeUriPrefixes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"silver/archive/"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"silver/tmp/"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"schedule"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"recurrence"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"frequency"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Week"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"interval"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"startTime"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"2026-01-01T02:00:00Z"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Scanning alone gives you asset discovery, Purview registers every table it finds. The second step is enrichment: adding descriptions, classifications, and business tags that make assets searchable by concept rather than just by name.</p>
<p>We built a lightweight enrichment script that ran after each scan and pushed descriptions from our dbt <code>schema.yml</code> files directly into Purview via the Atlas API:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># enrich_purview_assets.py</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> yaml</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> requests</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">PURVIEW_ENDPOINT </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://&lt;your-account&gt;.purview.azure.com"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">HEADERS </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"Authorization"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"Bearer </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">get_token</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">push_descriptions_from_dbt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">schema_yml_path</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">with</span><span class="token plain"> </span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">schema_yml_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> f</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        schema </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> yaml</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">safe_load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">f</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> model </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> schema</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"models"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        asset_name </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"silver.</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">model</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">[</span><span class="token string-interpolation interpolation string" style="color:#e3116c">'name'</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">]</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        description </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> model</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"description"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">""</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        tags </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> model</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"meta"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"business_tags"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Find asset GUID in Purview</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        search_resp </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string-interpolation string" style="color:#e3116c">f"</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">PURVIEW_ENDPOINT</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/catalog/api/search/query"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HEADERS</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            json</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"keywords"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> asset_name</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"limit"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        guid </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> search_resp</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"value"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"id"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Push description and tags</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">put</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string-interpolation string" style="color:#e3116c">f"</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">PURVIEW_ENDPOINT</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/catalog/api/atlas/v2/entity/guid/</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">guid</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            headers</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">HEADERS</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            json</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">"entity"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">"guid"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> guid</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token string" style="color:#e3116c">"attributes"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                        </span><span class="token string" style="color:#e3116c">"userDescription"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> description</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                        </span><span class="token string" style="color:#e3116c">"businessAttributes"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"tags"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> tags</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">push_descriptions_from_dbt</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"models/gold/schema.yml"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>After enrichment, a new engineer searching "monthly active users" in the Purview catalog surface gets <code>gold.user_engagement_monthly</code> as the top result with a description, the columns it contains, who owns it, and when it was last updated.</p>
<p><strong>Before:</strong> 18 hours finding the right table. <strong>After:</strong> under 20 minutes, self-serve.</p>
<p><img decoding="async" loading="lazy" alt="Purview Catalog Search UI" src="https://www.recodehive.com/assets/images/purview-catalog-search-c2a74216b9b5477cd24db54b79902dec.png" width="1693" height="929" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="capability-2-lineage-visualization-day-23-unlock">Capability 2: Lineage Visualization (Day 2–3 unlock)<a href="https://www.recodehive.com/blog/microsoft-data-purview#capability-2-lineage-visualization-day-23-unlock" class="hash-link" aria-label="Direct link to Capability 2: Lineage Visualization (Day 2–3 unlock)" title="Direct link to Capability 2: Lineage Visualization (Day 2–3 unlock)" translate="no">​</a></h3>
<p>Finding the right table was the first unlock. Understanding whether it was safe to modify a pipeline that fed into that table was the second.</p>
<p>Before Purview, tracing pipeline dependencies required one of two things: reading every ADF pipeline config file in sequence, or asking a senior engineer to walk you through it. Both took hours. The first was error-prone because pipeline dependency is not always explicit in ADF, a dataset referenced in one pipeline may be consumed by three others with no obvious link in the code.</p>
<p>Purview's lineage graph is populated automatically by the ADF integration. Once connected, every ADF pipeline run registers its source and sink assets in Purview, building a dependency graph that is always current, not a diagram someone drew once and forgot to update.</p>
<p>Setting up the ADF-to-Purview lineage connection is a one-time configuration:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain"># Step 1 - Enable managed identity on your ADF instance</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">az datafactory update \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  --resource-group rg-data-platform \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  --factory-name adf-prod \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  --identity '{"type": "SystemAssigned"}'</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># Step 2 - Grant ADF's managed identity the Data Curator role in Purview</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">az purview account add-root-collection-admin \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  --account-name purview-prod \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  --resource-group rg-data-platform \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  --object-id $(az datafactory show \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      --name adf-prod \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      --resource-group rg-data-platform \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      --query identity.principalId -o tsv)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># Step 3 - Connect ADF to Purview (done in ADF Studio UI or via ARM)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># In ADF Studio: Manage → Microsoft Purview → Connect to Purview account</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># Select your Purview account. ADF will start reporting lineage on next pipeline run.</span><br></div></code></pre></div></div>
<p>After this, every ADF pipeline run automatically updates the lineage graph. A new engineer who wants to know what feeds <code>gold.customer_metrics</code> opens the asset in Purview, clicks the Lineage tab, and sees the full upstream chain from the source Azure SQL tables, through the ADF copy activity, through the Spark transformation job, to the gold table without asking anyone.</p>
<p>The downstream view is just as valuable. Before touching a silver table, a new engineer can see exactly which gold tables and Power BI reports depend on it. That single capability eliminated the most common new-hire mistake: modifying a table without realizing it breaks a downstream report.</p>
<p><img decoding="async" loading="lazy" alt="Lineage Graph" src="https://www.recodehive.com/assets/images/purview-lineage-graph-7cad5661eba35a090a186575555ac32f.png" width="1774" height="887" class="img_ev3q"></p>
<p><strong>Before:</strong> 14 hours tracing dependencies manually. <strong>After:</strong> under 10 minutes in the lineage tab.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="capability-3-business-glossary--ownership-metadata-the-trust-layer">Capability 3: Business Glossary + Ownership Metadata (The trust layer)<a href="https://www.recodehive.com/blog/microsoft-data-purview#capability-3-business-glossary--ownership-metadata-the-trust-layer" class="hash-link" aria-label="Direct link to Capability 3: Business Glossary + Ownership Metadata (The trust layer)" title="Direct link to Capability 3: Business Glossary + Ownership Metadata (The trust layer)" translate="no">​</a></h3>
<p>The catalog tells you what tables exist. The lineage tells you how they connect. Neither tells you whether the table is the authoritative source for a given metric, who is responsible for it when something breaks, or what the business definition of the columns actually means.</p>
<p>Without that layer, a new engineer who found <code>gold.customer_metrics</code> via the catalog still had to ask: "Is this the one the finance team uses? Is <code>customer_lifetime_value</code> here calculated the same way as in the BI report?"</p>
<p>That is where the business glossary and ownership metadata close the gap.</p>
<p><strong>Business Glossary — defining terms once, everywhere</strong></p>
<p>We created a glossary term for every metric that had more than one implementation or a non-obvious definition. Each term includes the canonical definition, the authoritative table that implements it, and links to any non-authoritative implementations with explanations of how they differ.</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">// Purview Business Glossary Term — via REST API</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Customer Lifetime Value"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"shortDescription"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Predicted net revenue from a customer over their entire relationship with the company."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"longDescription"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Calculated as average order value × purchase frequency × average customer lifespan. The canonical implementation uses a 12-month trailing window. See gold.customer_ltv_annual for the authoritative source. Note: gold.customer_ltv_rolling uses a 90-day window for short-term forecasting — do not use for finance reporting."</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"status"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Approved"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"anchor"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"glossaryGuid"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"&lt;your-glossary-guid&gt;"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"contacts"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"Expert"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">{</span><span class="token property" style="color:#36acaa">"id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"&lt;priya-aad-object-id&gt;"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token property" style="color:#36acaa">"info"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Data Platform Lead"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"Steward"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">{</span><span class="token property" style="color:#36acaa">"id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"&lt;finance-team-aad-object-id&gt;"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token property" style="color:#36acaa">"info"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Finance Analytics"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"resources"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"displayName"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"gold.customer_ltv_annual"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"url"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://purview.azure.com/catalog/asset/&lt;guid&gt;"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Once a glossary term is created, it gets linked to the relevant table assets in the catalog. When a new engineer opens <code>gold.customer_metrics</code> in Purview, they see the glossary terms linked to each column, clickable definitions that explain what the column means in business terms, not just what its data type is.</p>
<p><strong>Ownership metadata — answering "who do I ask?" before it gets asked</strong></p>
<p>Every asset in Purview can have owners and expert contacts assigned. We built a convention: every gold table has exactly one <strong>owner</strong> (the team responsible for its accuracy) and one <strong>expert</strong> (the engineer who built or most recently maintained it). Silver tables have an expert; ownership is at the domain level.</p>
<p>We enforced this through a weekly scan that flagged unowned assets:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># check_unowned_assets.py — runs in CI on a weekly schedule</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> requests</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">PURVIEW_ENDPOINT </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"https://&lt;your-account&gt;.purview.azure.com"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">get_unowned_gold_assets</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    resp </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> requests</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">post</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string-interpolation string" style="color:#e3116c">f"</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">PURVIEW_ENDPOINT</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">/catalog/api/search/query"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        headers</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"Authorization"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string-interpolation string" style="color:#e3116c">f"Bearer </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">get_token</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        json</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"keywords"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"*"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"limit"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1000</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string" style="color:#e3116c">"filter"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token string" style="color:#e3116c">"and"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"collectionId"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"data-platform-gold"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"not"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"attributeName"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"contacts"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"operator"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"contains"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    unowned </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">a</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"qualifiedName"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> a </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> resp</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"value"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> unowned</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Post to Slack #data-platform-alerts</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        post_slack_alert</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token string-interpolation string" style="color:#e3116c">f":warning: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation builtin">len</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation">unowned</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> gold assets have no owner in Purview:\n"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token operator" style="color:#393A34">+</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"\n"</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"• `</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">name</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c">`"</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> name </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> unowned</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">get_unowned_gold_assets</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Within three weeks of running this check, ownership coverage on gold tables went from 34% to 97%.</p>
<p><strong>Before:</strong> 11 hours chasing ownership through Slack. <strong>After:</strong> under 5 minutes, open the asset, click the owner contact, done.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-the-onboarding-experience-looks-like-now">What the Onboarding Experience Looks Like Now<a href="https://www.recodehive.com/blog/microsoft-data-purview#what-the-onboarding-experience-looks-like-now" class="hash-link" aria-label="Direct link to What the Onboarding Experience Looks Like Now" title="Direct link to What the Onboarding Experience Looks Like Now" translate="no">​</a></h2>
<p>The best way to show what changed is to walk through Day 1 of onboarding as it exists today, compared to before.</p>
<p><strong>Before Purview — Day 1 task: "Understand how we calculate Monthly Active Users"</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">9:00am  — Assigned the task. Open the Gold layer ADLS container.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:15am  — 60 tables in gold/. No README. Start reading table names.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:45am  — Find two plausible tables: user_engagement_monthly, user_activity_agg.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">10:00am — Slack senior engineer: "Which one is canonical?"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">          No reply until 2pm (engineer in meetings).</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">2:05pm  — Directed to user_engagement_monthly. Ask what feeds it.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">2:10pm  — Told to look at the ADF pipeline. Which one? "Search for 'user'."</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">2:30pm  — Found three pipelines with 'user' in the name.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">3:00pm  — Slack a different engineer to confirm the right pipeline.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">4:00pm  — Reply received. Correct pipeline confirmed.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">4:15pm  — Start reading pipeline code.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">End of day: 7+ hours. Task: still not understood end-to-end.</span><br></div></code></pre></div></div>
<p><strong>After Purview — Day 1, same task</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">9:00am  — Assigned the task.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:03am  — Search "monthly active users" in Purview catalog.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:04am  — gold.user_engagement_monthly appears as top result,</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           tagged BI-certified, owner: Data Platform team.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:05am  — Click Lineage tab. Full upstream graph visible:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           app_events (Event Hub) → ADF ingest_clickstream</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           → silver.app_events_cleaned</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           → ADF transform_user_engagement</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           → gold.user_engagement_monthly</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           → Power BI: Product Dashboard</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:10am  — Click business glossary link on 'active_days' column.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           Definition: "Number of days in the month the user</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">           triggered at least one non-background app_event."</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:15am  — Open ADF transform_user_engagement directly from lineage.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">9:15am — Start reading pipeline code with full context already loaded.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">End of morning: task understood end-to-end. No Slack messages sent.</span><br></div></code></pre></div></div>
<p>The 7+ hours collapsed to 15 minutes of navigation. The rest of the day is actual work.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-numbers-before-and-after">The Numbers, Before and After<a href="https://www.recodehive.com/blog/microsoft-data-purview#the-numbers-before-and-after" class="hash-link" aria-label="Direct link to The Numbers, Before and After" title="Direct link to The Numbers, Before and After" translate="no">​</a></h2>
<p>After running the updated onboarding process with six new engineers over the following quarter, here is what we measured:</p>
<table><thead><tr><th>Metric</th><th>Before Purview</th><th>After Purview</th><th>Change</th></tr></thead><tbody><tr><td>Median time to first independent PR</td><td>14 days</td><td>4 days</td><td>−71%</td></tr><tr><td>Hours lost to "who owns this?" questions</td><td>11 hrs</td><td>0.5 hrs</td><td>−95%</td></tr><tr><td>Hours lost to table discovery</td><td>18 hrs</td><td>0.8 hrs</td><td>−96%</td></tr><tr><td>Hours lost to lineage tracing</td><td>14 hrs</td><td>1.2 hrs</td><td>−91%</td></tr><tr><td>Senior engineer interruptions per new hire (first 2 weeks)</td><td>23</td><td>4</td><td>−83%</td></tr><tr><td>New hire satisfaction score (onboarding survey, /10)</td><td>5.8</td><td>8.6</td><td>+48%</td></tr></tbody></table>
<p>The senior engineer interruption number deserves attention. Those 23 interruptions per new hire are not free. Each one costs the senior engineer 10–20 minutes of context-switching, and they compound — a team onboarding three engineers simultaneously was absorbing 60+ interruptions per two-week cycle. Purview did not eliminate senior engineer involvement in onboarding. It focused it on the things that actually require human judgment: code review, architectural decisions, domain nuance. The navigational questions — where is this, who owns that, what does this column mean, disappeared almost entirely.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>Track senior engineer interruption count as an onboarding KPI, not just new hire ramp time. It is a more honest measurement because it captures the full organizational cost of a slow onboarding, not just the cost to the new hire.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-got-wrong-the-first-time">What We Got Wrong the First Time<a href="https://www.recodehive.com/blog/microsoft-data-purview#what-we-got-wrong-the-first-time" class="hash-link" aria-label="Direct link to What We Got Wrong the First Time" title="Direct link to What We Got Wrong the First Time" translate="no">​</a></h2>
<p>The first Purview rollout, six months before the one described above, failed quietly. We scanned the assets, registered the tables, and told people it was there. Adoption was near zero. New engineers still asked on Slack.</p>
<p>Three things went wrong:</p>
<p><strong>We scanned but did not enrich.</strong> A catalog of 240 tables with no descriptions, no tags, and no business terminology is no more useful than the storage container it reflects. The scan gives you the skeleton. The enrichment, descriptions from dbt schema files, business tags, glossary links, gives it meaning. We skipped the enrichment step and produced a very expensive directory listing.</p>
<p><strong>We did not integrate it into the onboarding checklist.</strong> Purview existed as a tool, but new engineers were not told to use it on Day 1. The instinct to Slack a senior engineer is faster than the instinct to open a new tool, until the new tool is explicitly in the workflow. The fix was simple: the onboarding checklist now has "Complete the Purview orientation module" as item 3, before any pipeline work begins.</p>
<p><strong>Ownership coverage was too low to be trusted.</strong> When 66% of assets have no listed owner, the catalog trains engineers to distrust it. They look up a table, see no owner, assume the catalog is incomplete, and go back to Slack. Ownership coverage is a prerequisite for trust. We got coverage to 97% before re-launching and we now enforce it with the weekly automated check described above.</p>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>Do not announce your data catalog until ownership coverage is above 80% and descriptions are populated on all production-facing assets. A sparse catalog is worse than no catalog, it trains users to distrust the tool before it has a chance to prove its value.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-configuration-checklist">The Configuration Checklist<a href="https://www.recodehive.com/blog/microsoft-data-purview#the-configuration-checklist" class="hash-link" aria-label="Direct link to The Configuration Checklist" title="Direct link to The Configuration Checklist" translate="no">​</a></h2>
<p>If you are setting this up from scratch, here is the exact sequence that worked for us:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Phase 1 — Foundation (Week 1–2)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Create Purview account, configure collections by data domain</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Set up managed identity, grant least-privilege access to data sources</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Configure ADLS Gen2 scan (bronze, silver, gold separately)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Connect Azure SQL source scan</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Run first full scan — verify asset count matches expectation</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Connect ADF to Purview for automatic lineage reporting</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Phase 2 — Enrichment (Week 3–4)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Build enrichment script to push dbt descriptions → Purview via Atlas API</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Create business glossary terms for all metrics used in BI/finance reporting</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Link glossary terms to table assets and column-level metadata</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Run first ownership audit — assign owners and experts to all gold assets</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Phase 3 — Operationalization (Week 5–6)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Add Purview orientation to new hire onboarding checklist (Day 1, item 3)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Set up weekly scan schedule</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Deploy automated unowned-asset check with Slack alerting</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Create onboarding walkthrough doc: "Your First 3 Tasks in Purview"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Add Purview asset links to dbt model docs for cross-reference</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Phase 4 — Measurement (Ongoing)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Track: senior engineer interruptions per new hire (first 2 weeks)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Track: time to first independent PR</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Track: Purview search volume (proxy for adoption)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  ✓ Track: ownership coverage % (target: &gt;95% on gold, &gt;80% on silver)</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="Configuration Phases Timeline" src="https://www.recodehive.com/assets/images/purview-setup-phases-bf0e3ed0cd8ee6c703186c62dd35d3ad.png" width="1774" height="887" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="before-you-start-what-purview-cannot-do">Before You Start: What Purview Cannot Do<a href="https://www.recodehive.com/blog/microsoft-data-purview#before-you-start-what-purview-cannot-do" class="hash-link" aria-label="Direct link to Before You Start: What Purview Cannot Do" title="Direct link to Before You Start: What Purview Cannot Do" translate="no">​</a></h2>
<p>Purview is not a documentation system. It is a metadata system. The distinction matters.</p>
<p>If your tables have no documentation anywhere, no dbt descriptions, no wiki pages, no comments in DDL, Purview will faithfully catalog their absence. It will find your tables, register their schemas, and draw their lineage. But it cannot infer what <code>col_flg_v2_final</code> means. That knowledge has to come from somewhere, and if it only exists in someone's head, Purview cannot surface it.</p>
<p>The enrichment step, pushing dbt descriptions into Purview via the Atlas API, works because we already had descriptions in our dbt schema files. If you don't have that, the sequencing changes: write the documentation first, then enrich the catalog. Purview is an amplifier, not a generator.</p>
<p>It also does not replace code review. New engineers still need to read pipeline code. Purview makes them read it with context they know what the table is for, who owns it, what feeds it, which makes them read it faster and understand it more deeply. But it does not replace the code review cycle, the mentorship conversations, or the domain knowledge that only transfers through working together on real problems.</p>
<p>What it replaces is navigational friction. The questions that had no business being asked of a human in the first place.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-takeaways">Key Takeaways<a href="https://www.recodehive.com/blog/microsoft-data-purview#key-takeaways" class="hash-link" aria-label="Direct link to Key Takeaways" title="Direct link to Key Takeaways" translate="no">​</a></h2>
<p><strong>The two-week onboarding problem is a discoverability problem, not a complexity problem.</strong> New engineers are not slow because the data estate is hard to understand. They are slow because they cannot find what they are looking for without interrupting someone who built it.</p>
<p><strong>Catalog + lineage + ownership is the minimum viable combination.</strong> Each one alone is insufficient. Catalog without lineage tells you what exists but not how it connects. Lineage without catalog gives you a graph you cannot search. Both without ownership tell you the map but not who to call when the territory changes.</p>
<p><strong>Ownership coverage is a prerequisite, not a nice-to-have.</strong> A catalog with 34% ownership coverage teaches users to distrust the tool. Get coverage above 80% before announcing the rollout, and enforce it with automated checks after.</p>
<p><strong>Enrich before you announce.</strong> A scanned-but-unenriched catalog is an expensive directory listing. The descriptions, glossary links, and business tags are what make it a tool rather than a report.</p>
<p><strong>Measure the right thing.</strong> Time-to-first-PR is a lagging indicator. Senior engineer interruptions per new hire is the leading indicator, it tells you whether the catalog is actually being used before you see it in ramp time data.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="frequently-asked-questions">Frequently Asked Questions<a href="https://www.recodehive.com/blog/microsoft-data-purview#frequently-asked-questions" class="hash-link" aria-label="Direct link to Frequently Asked Questions" title="Direct link to Frequently Asked Questions" translate="no">​</a></h2>
<p><strong>Q: We use Databricks Unity Catalog, not Microsoft Purview. Does this apply?</strong></p>
<p>Most of the strategy applies, searchable catalog, lineage, ownership metadata, glossary but the implementation details differ significantly. Unity Catalog has native lineage for Databricks workloads, which is actually more automatic than Purview's ADF integration for Spark-heavy estates. The enrichment and ownership enforcement logic would be similar in principle, different in API. A follow-up post on Unity Catalog onboarding is in the works.</p>
<p><strong>Q: How long does the initial scan take on a large estate?</strong></p>
<p>Our 240-table scan completed in under 40 minutes. For larger estates (1,000+ assets), expect the first full scan to take 2–4 hours. Incremental scans after the first run are significantly faster, typically under 15 minutes for daily delta.</p>
<p><strong>Q: Do you use Purview's sensitivity labels and data classification?</strong></p>
<p>We do, but that is a governance and compliance story more than an onboarding story. Classification runs automatically as part of each scan and tags columns containing PII, financial data, or health information. New engineers see the classification labels on columns before writing any query, which is useful for access management training but was not a significant contributor to the onboarding time reduction.</p>
<p><strong>Q: What does Purview cost, and was the ROI clear?</strong></p>
<p>Purview pricing is based on data map capacity units and scan compute. For our estate (~250 assets, weekly scans), cost runs roughly $180–220/month. The ROI calculation is straightforward: a single senior engineer's hourly cost, multiplied by 23 interruptions per new hire at 15 minutes each, is about $345 per onboarding cycle at standard engineering rates. Purview paid for itself before the second new hire finished their first week.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references-and-further-reading">References and Further Reading<a href="https://www.recodehive.com/blog/microsoft-data-purview#references-and-further-reading" class="hash-link" aria-label="Direct link to References and Further Reading" title="Direct link to References and Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/purview/unified-catalog" target="_blank" rel="noopener noreferrer" class="">Microsoft Purview Documentation - Data Catalog</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/rest/api/purview/" target="_blank" rel="noopener noreferrer" class="">Microsoft Purview - REST API Reference</a></li>
<li class=""><a href="https://www.recodehive.com/blog/azure-cost-optimization" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Azure Data Pipeline Cost Optimization</a></li>
<li class=""><a href="https://www.recodehive.com/blog/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Medallion Architecture Explained</a></li>
<li class=""><a href="https://docs.getdbt.com/reference/resource-properties/description" target="_blank" rel="noopener noreferrer" class="">dbt Docs - schema.yml and model descriptions</a></li>
<li class=""><a href="https://www.montecarlodata.com/state-of-data-quality/" target="_blank" rel="noopener noreferrer" class="">Monte Carlo Data - State of Data Quality 2023</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/microsoft-data-purview#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p><strong>Aditya Singh Rathore</strong> is a Data Engineer focused on building modern, scalable data platforms on Azure. He writes about data engineering, cloud architecture, and real-world pipelines on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a>, turning hard-won production lessons into content anyone can apply.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Running Purview in your org? Or struggling with a data catalog rollout that didn't stick? Drop your experience in the comments, the most useful part of these posts is always what the comments surface that the post missed.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>azure</category>
            <category>microsoft-purview</category>
            <category>data-catalog</category>
            <category>data-governance</category>
            <category>onboarding</category>
            <category>data-lineage</category>
            <category>metadata</category>
            <category>data-engineering</category>
            <category>azure-data-factory</category>
            <category>delta-lake</category>
        </item>
        <item>
            <title><![CDATA[PySpark Optimization Techniques: 6 Mistakes That Slow Down Every Beginner's Pipeline]]></title>
            <link>https://www.recodehive.com/blog/spark-performance-optimizations</link>
            <guid>https://www.recodehive.com/blog/spark-performance-optimizations</guid>
            <pubDate>Sat, 16 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Six PySpark mistakes that silently kill pipeline performance and how to fix every one of them. Covering partitioning, shuffle tuning, caching, join strategies, UDFs, predicate pushdown, and cluster config. Explained from scratch.]]></description>
            <content:encoded><![CDATA[<p>The job ran for four hours. It processed 8GB of data.</p>
<p>A file copy of that same 8GB on the same machine would have taken about 45 seconds.</p>
<p>That gap between what Spark <em>could</em> do and what it actually did was entirely self-inflicted. Not because the logic was wrong. The output was correct. But six decisions that seemed harmless at the time were quietly multiplying the runtime: a Python UDF where a built-in function existed, a join that shuffled 200 million rows when it didn't have to, a read that scanned 90 days of data to find yesterday's records.</p>
<p>This post is about those six decisions. Each one is a pattern that beginners hit constantly, not because they're careless, but because PySpark doesn't stop you. It runs the slow version just as willingly as the fast one. You only find out at 3am when the SLA is missed.</p>
<p><strong>What you'll learn in this post:</strong></p>
<ul>
<li class="">Why too many shuffle partitions is just as bad as too few and how to pick the right number</li>
<li class="">How caching works under the hood, and when it actively hurts performance</li>
<li class="">The three join strategies Spark supports and exactly when to use each one</li>
<li class="">Why Python UDFs are a performance trap and what to use instead</li>
<li class="">How predicate pushdown and column pruning reduce data read before any Spark code runs</li>
<li class="">How to size executors so you stop leaving half your cluster idle</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-pipeline-well-optimize">The Pipeline We'll Optimize<a href="https://www.recodehive.com/blog/spark-performance-optimizations#the-pipeline-well-optimize" class="hash-link" aria-label="Direct link to The Pipeline We'll Optimize" title="Direct link to The Pipeline We'll Optimize" translate="no">​</a></h2>
<p>Every example in this post uses a single, realistic pipeline so you can see how the fixes interact with each other:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Raw sales events (JSON, S3/ADLS)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Bronze Delta table (~200M rows, partitioned by event_date)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">PySpark transformation job</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Silver Delta table (deduplicated, enriched, typed)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Reporting aggregation</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Gold Delta table (daily summaries)</span><br></div></code></pre></div></div>
<p>Before any optimization: <strong>4 hours 12 minutes</strong> end-to-end on a 4-node cluster.<br>
<!-- -->After all six fixes: <strong>34 minutes</strong> on the same cluster.</p>
<p>Let's go through every mistake.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-1-wrong-number-of-shuffle-partitions">Mistake #1: Wrong Number of Shuffle Partitions<a href="https://www.recodehive.com/blog/spark-performance-optimizations#mistake-1-wrong-number-of-shuffle-partitions" class="hash-link" aria-label="Direct link to Mistake #1: Wrong Number of Shuffle Partitions" title="Direct link to Mistake #1: Wrong Number of Shuffle Partitions" translate="no">​</a></h2>
<p><strong>Time lost to this mistake: ~55 minutes</strong></p>
<p>Every time Spark needs to reorganize data across the network after a <code>groupBy</code>, a <code>join</code>, or a <code>distinct</code>, it performs a <strong>shuffle</strong>. The shuffled data is split into partitions, and the number of those partitions is controlled by one setting:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">get</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.shuffle.partitions"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Default: "200"</span><br></div></code></pre></div></div>
<p>The default is 200. That number made sense for the era of multi-TB Hadoop clusters it was designed for. On our 8GB pipeline, it created a different problem entirely: 200 tasks launched, each assigned a few megabytes of data, spending more time on Spark's task scheduling machinery than on the actual groupBy computation. The cluster looked busy. The progress bar moved. But most of what was happening was overhead, not work.</p>
<p>The inverse bites you just as hard in the other direction. Too few partitions on a genuinely large dataset means each task takes on more data than it can hold in memory and starts spilling to disk — and disk spills are catastrophically slow compared to in-memory operations.</p>
<p><img decoding="async" loading="lazy" alt="visual explaining shuffle partitioning" src="https://www.recodehive.com/assets/images/01_suffle-9382352c3be088e983e045d683538ce1.png" width="1774" height="887" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="understanding-partitions-first">Understanding Partitions First<a href="https://www.recodehive.com/blog/spark-performance-optimizations#understanding-partitions-first" class="hash-link" aria-label="Direct link to Understanding Partitions First" title="Direct link to Understanding Partitions First" translate="no">​</a></h3>
<p>Think of shuffle partitions like checkout lanes at a supermarket. If you open 200 lanes for 64 customers, each cashier handles one customer and then sits idle, you're paying for 136 empty lanes. Open 4 lanes for 200 customers and you get a queue that never moves. The goal is matching lane count to the actual number of customers, not picking a number that sounds safe.</p>
<p>In Spark terms: target <strong>128MB to 200MB of data per partition</strong> after a shuffle.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Ideal partitions = Total data size after shuffle ÷ 128MB</span><br></div></code></pre></div></div>
<p>For our 8GB transformation job (data after the join/groupBy, not raw input):</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Ideal partitions = 8,000MB ÷ 128MB ≈ 63</span><br></div></code></pre></div></div>
<p>We round to a clean number, 64 and set it before any transformation runs.</p>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Before</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">After</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">Impact</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">transformation-before.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Default: 200 shuffle partitions</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># On 8GB of data, each partition is ~40MB</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># 200 tasks scheduled, most doing trivial work</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Task launch overhead dominates actual compute time</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"abfss://data@lake.dfs.core.windows.net/bronze/sales/"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">groupBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">agg</span><span class="token punctuation" style="color:#393A34">(</span><span class="token builtin">sum</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"revenue"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">alias</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"total_revenue"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">transformation-after.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Set BEFORE any transformations run</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Rule of thumb: Total shuffle data size ÷ 128MB</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># For ~8GB post-shuffle data: 8000 ÷ 128 ≈ 64</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.shuffle.partitions"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"64"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"abfss://data@lake.dfs.core.windows.net/bronze/sales/"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">groupBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">agg</span><span class="token punctuation" style="color:#393A34">(</span><span class="token builtin">sum</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"revenue"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">alias</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"total_revenue"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><table><thead><tr><th>Setting</th><th>Shuffle Partitions</th><th>Stage Duration</th><th>Tasks Launched</th></tr></thead><tbody><tr><td>Default (200)</td><td>200</td><td>48 min</td><td>200</td></tr><tr><td>Tuned (64)</td><td>64</td><td>11 min</td><td>64</td></tr></tbody></table></div></div></div>
<p><strong>Result:</strong> Aggregation stage dropped from 48 minutes to 11 minutes. <strong>~37 minutes saved.</strong></p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>For Databricks on Delta Lake, you can also enable <strong>Adaptive Query Execution (AQE)</strong>, which automatically adjusts shuffle partitions at runtime based on actual data size:</p><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.adaptive.enabled"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"true"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.adaptive.coalescePartitions.enabled"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"true"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div><p>AQE doesn't replace manual tuning but it acts as a safety net when your estimate is off. We run both: manual tuning as the primary setting, AQE as the fallback.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-2-caching-everything-or-nothing">Mistake #2: Caching Everything (Or Nothing)<a href="https://www.recodehive.com/blog/spark-performance-optimizations#mistake-2-caching-everything-or-nothing" class="hash-link" aria-label="Direct link to Mistake #2: Caching Everything (Or Nothing)" title="Direct link to Mistake #2: Caching Everything (Or Nothing)" translate="no">​</a></h2>
<p><strong>Time lost to this mistake: ~28 minutes</strong></p>
<p>Caching is one of the most misunderstood features in PySpark. Beginners either avoid it entirely (paying to recompute the same DataFrame multiple times) or cache everything (consuming all available memory and forcing everything else to spill to disk).</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-caching-actually-does">What Caching Actually Does<a href="https://www.recodehive.com/blog/spark-performance-optimizations#what-caching-actually-does" class="hash-link" aria-label="Direct link to What Caching Actually Does" title="Direct link to What Caching Actually Does" translate="no">​</a></h3>
<p>Calling <code>.cache()</code> on a DataFrame doesn't immediately store anything, Spark is lazy, so nothing happens until an action triggers computation. What <code>.cache()</code> actually does is plant a flag that says: <em>the first time you compute this, hold onto the result.</em> The next time something references this DataFrame, Spark reads from that stored result instead of re-running the entire computation from scratch.</p>
<p>The reason this matters is that Spark has no implicit memory of previous computations. Without caching, every action that references <code>base_df</code> starts from the beginning, re-reading the source files, re-running the joins, re-applying the filters. We discovered this the painful way when a pipeline that looked like one job was actually running the most expensive stage twice, adding 28 minutes to every run.</p>
<p>This only helps if you reference the same DataFrame more than once. If you compute a DataFrame, transform it once, and write it, caching adds overhead with zero benefit.</p>
<p><img decoding="async" loading="lazy" alt="Visual explaining caching in a analogy" src="https://www.recodehive.com/assets/images/caching_pipeline-860c3a370f3e92bb0e597ff11ad0aeed.png" width="1774" height="887" class="img_ev3q"></p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># This caching is useless — df is only used once</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">silver_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">cache</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">gold_path</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The right time to cache is when a DataFrame is expensive to compute <em>and</em> you reference it in multiple downstream operations.</p>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Before (No Cache)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">After (Targeted Cache)</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">pipeline-no-cache.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># base_df is computed TWICE — once for each write</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Spark re-reads and re-joins from scratch each time</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">bronze_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">products_df</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"left"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> yesterday</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># First action — triggers full computation of base_df</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"append"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">silver_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Second action — triggers FULL recomputation of base_df again</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">groupBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">agg</span><span class="token punctuation" style="color:#393A34">(</span><span class="token builtin">sum</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"revenue"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">gold_path</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">pipeline-with-cache.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> StorageLevel</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">bronze_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">products_df</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"left"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> yesterday</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Cache BEFORE the first action — base_df is used twice</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># MEMORY_AND_DISK: spills to disk if memory is full (safer than MEMORY_ONLY)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">persist</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">StorageLevel</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">MEMORY_AND_DISK</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># First use — computes and stores base_df</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"append"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">silver_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Second use — reads from cache, no recomputation</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">groupBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">agg</span><span class="token punctuation" style="color:#393A34">(</span><span class="token builtin">sum</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"revenue"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">gold_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Always unpersist when done — frees executor memory for the next stage</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">base_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">unpersist</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div></div></div>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>We always use <code>MEMORY_AND_DISK</code> rather than <code>MEMORY_ONLY</code>. The reason: when memory fills up, <code>MEMORY_ONLY</code> silently drops the cached data and recomputes it on demand, you get none of the benefit and all of the overhead. We got burned by this once when a larger-than-usual dataset caused silent eviction mid-pipeline. <code>MEMORY_AND_DISK</code> spills the overflow to disk instead of evicting, which is slower than memory but far better than recomputing from scratch.</p></div></div>
<p><strong>Result:</strong> Eliminated one full recomputation of the join + filter stage. <strong>~28 minutes saved.</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-3-using-the-wrong-join-strategy">Mistake #3: Using the Wrong Join Strategy<a href="https://www.recodehive.com/blog/spark-performance-optimizations#mistake-3-using-the-wrong-join-strategy" class="hash-link" aria-label="Direct link to Mistake #3: Using the Wrong Join Strategy" title="Direct link to Mistake #3: Using the Wrong Join Strategy" translate="no">​</a></h2>
<p><strong>Time lost to this mistake: ~62 minutes</strong></p>
<p>Joins are the most expensive operation in distributed computing. When two datasets need to be joined, Spark has to get rows with matching keys onto the same machine which usually means moving large amounts of data across the network. That network movement is called a shuffle, and it's where most of the time in a join stage actually goes.</p>
<p>PySpark supports three join strategies. Understanding which one to use and when is one of the highest-leverage optimizations available.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-three-strategies">The Three Strategies<a href="https://www.recodehive.com/blog/spark-performance-optimizations#the-three-strategies" class="hash-link" aria-label="Direct link to The Three Strategies" title="Direct link to The Three Strategies" translate="no">​</a></h3>
<p><strong>Sort-Merge Join (default for large tables)</strong>
Both datasets are shuffled so matching keys land on the same partition, then sorted, then merged. Correct for any size. Expensive because of the full shuffle.</p>
<p><strong>Broadcast Join (best for large + small table)</strong>
The smaller table is collected to the driver and sent as a complete copy to every executor. The large table never moves. Dramatically faster when the small table fits comfortably in memory.</p>
<p><strong>Bucket Join (best for repeated joins on the same key)</strong>
Both tables are pre-arranged on disk by join key at write time. When you join two bucketed tables on their bucket key, Spark skips the shuffle entirely, the data is already sitting where it needs to be. Expensive upfront, free on every subsequent join.</p>
<p><img decoding="async" loading="lazy" alt="sort merge joins vs merge join" src="https://www.recodehive.com/assets/images/sort_vs_merge_join-3c0749052cf17608e4f3617cd1df6bdd.png" width="1774" height="887" class="img_ev3q"></p>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Before (Default Sort-Merge)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">After (Broadcast Join)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">Bucket Join (Advanced)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">Strategy Guide</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">join-before.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Spark defaults to Sort-Merge Join</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># products_df has 50,000 rows — tiny</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># But Spark doesn't know that and shuffles BOTH tables</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># 200M rows of sales_df shuffled across the network</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">sales_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">bronze_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">products_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">products_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">enriched_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> sales_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">products_df</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"left"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">join-after.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functions </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> broadcast</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">sales_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">bronze_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">products_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">products_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Hint tells Spark to broadcast products_df to every executor</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># sales_df (200M rows) is NEVER shuffled</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># products_df (50K rows) is collected once and sent to all nodes</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">enriched_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> sales_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">broadcast</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">products_df</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"left"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">bucket-join.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Write tables once with bucketing — expensive upfront, free on every future join</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Use when the same large-to-large join runs repeatedly</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">sales_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">bucketBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">64</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sortBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"parquet"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">saveAsTable</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"sales_bucketed"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">events_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">bucketBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">64</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sortBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"parquet"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">saveAsTable</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"events_bucketed"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Now this join has ZERO shuffle — data is already co-located</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">result </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">table</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"sales_bucketed"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">join</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">table</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"events_bucketed"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><table><thead><tr><th>Scenario</th><th>Strategy</th><th>Why</th></tr></thead><tbody><tr><td>Large table + small table (&lt; 200MB)</td><td>Broadcast join</td><td>Eliminates shuffle of large table</td></tr><tr><td>Large table + large table, one-time</td><td>Sort-merge (default)</td><td>No alternative without pre-partitioning</td></tr><tr><td>Large table + large table, repeated</td><td>Bucket join</td><td>Pre-pays shuffle cost once, eliminates it forever</td></tr><tr><td>Skewed keys (a few keys have millions of rows)</td><td>Salting + broadcast</td><td>See tip below</td></tr></tbody></table></div></div></div>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p><strong>Join skew</strong> is a related problem: when a small number of keys have a disproportionate number of rows, all that data lands on one executor which becomes a bottleneck while the rest of the cluster sits idle. The fix is <strong>salting</strong>: add a random integer (0–N) to the skewed key, replicate the smaller table N times with matching salt values, join on the salted key, then drop the salt column. This spreads the skewed key across N executors.</p></div></div>
<p><strong>Result:</strong> Switching the dimension join from sort-merge to broadcast eliminated the largest shuffle in the pipeline. <strong>~62 minutes saved.</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-4-writing-python-udfs-instead-of-using-built-in-functions">Mistake #4: Writing Python UDFs Instead of Using Built-in Functions<a href="https://www.recodehive.com/blog/spark-performance-optimizations#mistake-4-writing-python-udfs-instead-of-using-built-in-functions" class="hash-link" aria-label="Direct link to Mistake #4: Writing Python UDFs Instead of Using Built-in Functions" title="Direct link to Mistake #4: Writing Python UDFs Instead of Using Built-in Functions" translate="no">​</a></h2>
<p><strong>Time lost to this mistake: ~38 minutes</strong></p>
<p>Python UDFs (User Defined Functions) feel like a natural escape hatch. The built-in Spark functions don't cover what you need, so you write a Python function, decorate it with <code>@udf</code>, and move on. It works. It's just slow in a way that isn't immediately obvious and on a 200-million-row dataset, "not immediately obvious" can mean 38 extra minutes per run.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-udfs-are-expensive">Why UDFs Are Expensive<a href="https://www.recodehive.com/blog/spark-performance-optimizations#why-udfs-are-expensive" class="hash-link" aria-label="Direct link to Why UDFs Are Expensive" title="Direct link to Why UDFs Are Expensive" translate="no">​</a></h3>
<p>Here's what's actually happening when a Python UDF runs on a Spark cluster: PySpark lives on the JVM, and Python lives in a completely separate process. Every single row your UDF touches has to be packaged up, handed across a process boundary into the Python runtime, processed, and then packaged back up and handed back to the JVM. It's the equivalent of passing every item from a warehouse to a worker standing outside the building through a narrow window, one item at a time, both ways.</p>
<p>We had three UDFs doing string cleaning on a 200-million-row DataFrame. Each UDF triggered that full cross-process handoff 200 million times. The functions themselves were trivial, a regex and some string lowercasing. The cost wasn't in the logic, it was in the 600 million window-handoffs happening around it.</p>
<p>Built-in Spark functions (<code>pyspark.sql.functions</code>) don't have this problem. They run entirely inside the JVM alongside Spark's own engine, with no process boundary to cross and no per-row packaging overhead.</p>
<p><img decoding="async" loading="lazy" alt="python user define functions" src="https://www.recodehive.com/assets/images/python_udf_vs_built_in-6f2b08c1789ea99e8ee6a55099f292b4.png" width="1774" height="887" class="img_ev3q"></p>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Before (Python UDF)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">After (Built-in Functions)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">When UDFs Are Unavoidable</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">udf-before.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functions </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> udf</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">types </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> StringType</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> re</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Registered as a Python UDF</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># For 200M rows: cross-process handoff happens 200M times per UDF</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@udf</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">returnType</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">StringType</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">clean_phone</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">phone</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> phone </span><span class="token keyword" style="color:#00009f">is</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    digits </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> re</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sub</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">r"\D"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">""</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> phone</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> digits </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> </span><span class="token builtin">len</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">digits</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@udf</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">returnType</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">StringType</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">normalize_category</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">cat</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> cat </span><span class="token keyword" style="color:#00009f">is</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"unknown"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> cat</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">lower</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">replace</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">" "</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"_"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"phone_clean"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> clean_phone</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"phone"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"category_norm"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> normalize_category</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">udf-after.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functions </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    regexp_replace</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> when</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> length</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> trim</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> lower</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> col</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># All native JVM execution — no cross-process overhead at all</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    df</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Strip non-digits from phone</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"phone_digits"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> regexp_replace</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"phone"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">r"\D"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">""</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Keep only 10-digit numbers, null otherwise</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"phone_clean"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        when</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">length</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"phone_digits"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"phone_digits"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">otherwise</span><span class="token punctuation" style="color:#393A34">(</span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Normalize category: trim, lowercase, replace spaces</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"category_norm"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        when</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">isNull</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"unknown"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">otherwise</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            regexp_replace</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">lower</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">trim</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"category"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">" "</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"_"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">drop</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"phone_digits"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">pandas-udf.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># If no built-in equivalent exists, use a Pandas UDF (vectorized)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Pandas UDFs process data in Arrow batches, not row-by-row</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Still crosses the process boundary, but once per batch instead of once per row</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functions </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> pandas_udf</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">types </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> StringType</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> pandas </span><span class="token keyword" style="color:#00009f">as</span><span class="token plain"> pd</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token decorator annotation punctuation" style="color:#393A34">@pandas_udf</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">StringType</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">complex_transform</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">series</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> pd</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Series</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> pd</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Series</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># This runs on batches of rows, not individual rows</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token comment" style="color:#999988;font-style:italic"># Use only when no built-in function covers your logic</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> series</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">apply</span><span class="token punctuation" style="color:#393A34">(</span><span class="token keyword" style="color:#00009f">lambda</span><span class="token plain"> x</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> your_complex_logic</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">x</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> x </span><span class="token keyword" style="color:#00009f">else</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">None</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"result"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> complex_transform</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"input_col"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div></div></div>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The decision tree we follow for function choice:</p><ol>
<li class="">Does a <code>pyspark.sql.functions</code> built-in exist? → <strong>Use it.</strong></li>
<li class="">Does the logic involve complex Python libraries (ML models, regex with lookbehind, etc.)? → <strong>Use a Pandas UDF.</strong></li>
<li class="">Is there truly no alternative? → <strong>Use a Python UDF, and leave a comment explaining why.</strong></li>
</ol><p>The vast majority of string cleaning, type casting, null handling, and conditional logic is covered by built-in functions. Check the <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html" target="_blank" rel="noopener noreferrer" class="">PySpark function docs</a> before reaching for <code>@udf</code> — it takes 5 minutes and has saved us hours.</p></div></div>
<p><strong>Result:</strong> Replaced three Python UDFs with built-in equivalents. Stage runtime dropped from 41 minutes to 3 minutes. <strong>~38 minutes saved.</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-5-reading-more-data-than-necessary">Mistake #5: Reading More Data Than Necessary<a href="https://www.recodehive.com/blog/spark-performance-optimizations#mistake-5-reading-more-data-than-necessary" class="hash-link" aria-label="Direct link to Mistake #5: Reading More Data Than Necessary" title="Direct link to Mistake #5: Reading More Data Than Necessary" translate="no">​</a></h2>
<p><strong>Time lost to this mistake: ~44 minutes</strong></p>
<p>Before any transformation runs, the data has to come off storage and into Spark's memory. If you pull 180GB when you only need 2GB, you've already lost, no amount of smart transformation logic downstream recovers those wasted read operations.</p>
<p>Two mechanisms cut data at the source: <strong>predicate pushdown</strong> and <strong>column pruning</strong>. Both work with Parquet and Delta Lake natively. Both get silently deactivated by small, easy-to-miss coding patterns.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="predicate-pushdown">Predicate Pushdown<a href="https://www.recodehive.com/blog/spark-performance-optimizations#predicate-pushdown" class="hash-link" aria-label="Direct link to Predicate Pushdown" title="Direct link to Predicate Pushdown" translate="no">​</a></h3>
<p>Imagine your Delta table as a library where each day's data lives in its own room, with the date on the door. Partition pruning is walking straight to yesterday's room. What we were doing instead was opening every room in the library, pulling every book off every shelf, carrying it all to a reading table, and then only reading the ones with yesterday's date on the spine before putting everything else back. The library was organized correctly. We just weren't reading the signs on the doors.</p>
<p>With 90 days of history accumulated, we were reading 90x more data than the job actually needed on every single run. The fix is pushing the date filter into the read itself, so Spark can use the partition directory structure to skip everything irrelevant before a single file is opened.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="column-pruning">Column Pruning<a href="https://www.recodehive.com/blog/spark-performance-optimizations#column-pruning" class="hash-link" aria-label="Direct link to Column Pruning" title="Direct link to Column Pruning" translate="no">​</a></h3>
<p>Parquet stores data column by column, not row by row. This means if your table has 40 columns but your transformation uses 6, you can tell Spark to only load those 6 columns' physical data from disk. The other 34 are never touched. The catch: you have to select those columns at read time, not after a chain of transformations.</p>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Before (Full Scan)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">After (Pushdown + Pruning)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">Verify It's Working</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">read-before.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> datetime</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">yesterday </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">datetime</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">now</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token plain"> timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">days</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strftime</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"%Y-%m-%d"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Reads ALL columns from ALL partitions</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Then filters in memory — after all 180GB is already read</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">bronze_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">bronze_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Filter applied AFTER load — no partition pruning, no column pruning</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">filtered_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    bronze_df</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> yesterday</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"status"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"completed"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Selecting columns here is too late — data already read into memory</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">result_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> filtered_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"event_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"revenue"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">read-after.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> datetime</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functions </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> col</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">yesterday </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">datetime</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">now</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token plain"> timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">days</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strftime</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"%Y-%m-%d"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Column pruning: Spark reads ONLY these columns from Parquet files</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Predicate pushdown: partition filter applied at file-reader level</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Spark skips all partitions where event_date != yesterday</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">bronze_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">bronze_path</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">select</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"event_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"product_id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"revenue"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"status"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># Prune columns first</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"event_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> yesterday</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">      </span><span class="token comment" style="color:#999988;font-style:italic"># Partition pruning — activates at read time</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"status"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"completed"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># Predicate pushdown into Parquet row groups</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">verify-pushdown.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Confirm predicate pushdown is active — check the physical plan</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">bronze_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">explain</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">mode</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"extended"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># In the output, look for:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># PartitionFilters: [isnotnull(event_date#12), (event_date#12 = 2026-05-15)]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># PushedFilters: [IsNotNull(status), EqualTo(status,completed)]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic">#</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># If you see PartitionFilters: []  →  partition pruning is NOT active</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># If you see PushedFilters: []     →  predicate pushdown is NOT active</span><br></div></code></pre></div></div></div></div></div>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>Two things both have to be true for partition pruning to work. First, the table must have been written with <code>partitionBy</code> on the column you're filtering. Second and this is the one that catches people, the filter must be on the partition column <em>as it exists in the table</em>, not on a renamed or derived version. We once spent an hour debugging a full scan that turned out to be caused by a <code>withColumnRenamed("event_date", "date")</code> sitting one line before the filter. The column name changed, Spark couldn't match it to the partition metadata, and pruning silently fell back to a full scan.</p></div></div>
<p><strong>Result:</strong> Data read dropped from ~180GB to ~2GB. Read + deserialization time fell from 47 minutes to 3 minutes. <strong>~44 minutes saved.</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-6-default-cluster-configuration">Mistake #6: Default Cluster Configuration<a href="https://www.recodehive.com/blog/spark-performance-optimizations#mistake-6-default-cluster-configuration" class="hash-link" aria-label="Direct link to Mistake #6: Default Cluster Configuration" title="Direct link to Mistake #6: Default Cluster Configuration" translate="no">​</a></h2>
<p><strong>Time lost to this mistake: ~35 minutes (idle and wasted compute)</strong></p>
<p>Even with perfect code, a misconfigured cluster leaves compute sitting idle. These settings control how many tasks run in parallel, how much memory each task gets, and whether the cluster actually uses all the hardware you're paying for.</p>
<p>Beginners typically either accept the cloud provider's defaults without question, or paste settings from a Stack Overflow answer written for a different dataset and cluster size. Neither approach reflects the actual workload.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-key-settings-and-what-they-do">The Key Settings and What They Do<a href="https://www.recodehive.com/blog/spark-performance-optimizations#the-key-settings-and-what-they-do" class="hash-link" aria-label="Direct link to The Key Settings and What They Do" title="Direct link to The Key Settings and What They Do" translate="no">​</a></h3>
<p><strong><code>spark.executor.memory</code></strong> - how much RAM each executor process gets. Too little and tasks start writing intermediate data to disk, which is dramatically slower. Too much and you've allocated headroom the executor can't use, while also giving the JVM garbage collector more memory to scan on every GC cycle.</p>
<p><strong><code>spark.executor.cores</code></strong> - how many tasks an executor runs simultaneously. We settled on 5 after testing: below 4, the executor's memory sits underutilized because there aren't enough concurrent tasks to fill it. Above 5, we started seeing storage I/O contention — too many tasks competing to read from the same disks at once. Five was the sweet spot for our setup, and it matches what we've seen hold up across different cluster sizes.</p>
<p><strong><code>spark.executor.instances</code></strong> - total number of executors. With autoscale on, this becomes a min/max bound rather than a fixed count.</p>
<p><strong><code>spark.driver.memory</code></strong> - the driver collects broadcast tables before distributing them to executors, so it needs more headroom than the default 1g allows. We had broadcast joins failing silently and falling back to sort-merge before we realized the driver was OOM-ing on the collection step.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="right-sizing-for-our-8gb-pipeline">Right-Sizing for Our 8GB Pipeline<a href="https://www.recodehive.com/blog/spark-performance-optimizations#right-sizing-for-our-8gb-pipeline" class="hash-link" aria-label="Direct link to Right-Sizing for Our 8GB Pipeline" title="Direct link to Right-Sizing for Our 8GB Pipeline" translate="no">​</a></h3>
<p>Our cluster: 4 worker nodes, each with 16 cores and 64GB RAM.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Available per node after OS overhead (~7GB): 57GB RAM, 15 cores</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Executor cores: 5 (our tested sweet spot)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Executors per node: 15 ÷ 5 = 3 executors per node</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Memory per executor: 57GB ÷ 3 = 19GB (leave ~1GB headroom → set 18GB)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Total executors: 3 × 4 nodes = 12 executors</span><br></div></code></pre></div></div>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Before (Defaults)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">After (Right-Sized)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">Config Reference</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">cluster-default.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Default Spark config — unchanged from cluster creation</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># On our 4-node cluster, these settings leave most resources unused</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> SparkSession</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">builder \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">appName</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"SalesPipeline"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getOrCreate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Defaults that hurt us:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># spark.executor.memory    = 1g   (way too small — spills to disk constantly)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># spark.executor.cores     = 1    (only 1 task per executor — 15 cores idle per node)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># spark.executor.instances = 2    (2 executors on a 4-node cluster — 50% idle)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># spark.driver.memory      = 1g   (broadcast joins silently fall back to sort-merge)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">cluster-tuned.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">spark </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> SparkSession</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">builder \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">appName</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"SalesPipeline"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.executor.memory"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"18g"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.executor.cores"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"5"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.executor.instances"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"12"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.driver.memory"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"8g"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.shuffle.partitions"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"64"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.adaptive.enabled"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"true"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.dynamicAllocation.enabled"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"true"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.dynamicAllocation.minExecutors"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"2"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.dynamicAllocation.maxExecutors"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"12"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getOrCreate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><table><thead><tr><th>Setting</th><th>Default</th><th>Our Value</th><th>Rule of Thumb</th></tr></thead><tbody><tr><td><code>spark.executor.memory</code></td><td>1g</td><td>18g</td><td>(Node RAM − OS overhead) ÷ executors per node</td></tr><tr><td><code>spark.executor.cores</code></td><td>1</td><td>5</td><td>4–5 per executor (test on your setup)</td></tr><tr><td><code>spark.executor.instances</code></td><td>2</td><td>12</td><td>(cores per node ÷ executor cores) × node count</td></tr><tr><td><code>spark.driver.memory</code></td><td>1g</td><td>8g</td><td>4–8g; higher if using large broadcasts</td></tr><tr><td><code>spark.sql.shuffle.partitions</code></td><td>200</td><td>64</td><td>Total shuffle data size ÷ 128MB</td></tr></tbody></table></div></div></div>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>For cloud clusters (Databricks, EMR, Dataproc), enable <strong>dynamic allocation</strong> instead of a fixed executor count. Dynamic allocation releases executors back to the pool during idle stages and acquires more when tasks are queuing — so a 3-minute light stage doesn't hold 12 executors that other jobs could use.</p><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.dynamicAllocation.enabled"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"true"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.dynamicAllocation.minExecutors"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"2"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.dynamicAllocation.maxExecutors"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"12"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div></div>
<p><strong>Result:</strong> Fully utilizing all 4 nodes reduced total wall-clock time by eliminating idle compute. Combined with eliminating disk spills from under-provisioned executors: <strong>~35 minutes saved.</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="before-and-after-summary">Before and After Summary<a href="https://www.recodehive.com/blog/spark-performance-optimizations#before-and-after-summary" class="hash-link" aria-label="Direct link to Before and After Summary" title="Direct link to Before and After Summary" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><table><thead><tr><th>Mistake</th><th>Root Cause</th><th>Time Before</th><th>Time After</th><th>Saved</th></tr></thead><tbody><tr><td>Wrong shuffle partition count</td><td>Default 200 partitions for 8GB dataset</td><td>48 min</td><td>11 min</td><td><strong>37 min</strong></td></tr><tr><td>No caching on reused DataFrame</td><td>base_df computed twice from scratch</td><td>28 min</td><td>&lt;1 min</td><td><strong>28 min</strong></td></tr><tr><td>Sort-merge join on dimension table</td><td>50K-row table shuffled like a large table</td><td>65 min</td><td>3 min</td><td><strong>62 min</strong></td></tr><tr><td>Python UDFs for string operations</td><td>Per-row cross-process overhead</td><td>41 min</td><td>3 min</td><td><strong>38 min</strong></td></tr><tr><td>Full table scan on partitioned table</td><td>Filter applied after read, not at read time</td><td>47 min</td><td>3 min</td><td><strong>44 min</strong></td></tr><tr><td>Default cluster config (1 core/executor)</td><td>15 cores idle per node, constant disk spill</td><td>45 min</td><td>10 min</td><td><strong>35 min</strong></td></tr><tr><td><strong>Total</strong></td><td></td><td><strong>4h 12min</strong></td><td><strong>34 min</strong></td><td><strong>~3h 38min</strong></td></tr></tbody></table></div></div>
<p>From 4 hours 12 minutes down to 34 minutes — an <strong>86% reduction</strong> on a pipeline doing exactly the same computation on exactly the same data.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="pyspark-optimization-checklist">PySpark Optimization Checklist<a href="https://www.recodehive.com/blog/spark-performance-optimizations#pyspark-optimization-checklist" class="hash-link" aria-label="Direct link to PySpark Optimization Checklist" title="Direct link to PySpark Optimization Checklist" translate="no">​</a></h2>
<p>Run through this before every pipeline goes to production.</p>
<p><strong>Shuffle &amp; Partitions</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is <code>spark.sql.shuffle.partitions</code> set based on actual post-shuffle data size, not the default 200?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is Adaptive Query Execution (<code>spark.sql.adaptive.enabled</code>) turned on?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Are there stages with a very large or very small number of tasks compared to the cluster size?</li>
</ul>
<p><strong>Caching</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is any DataFrame referenced more than once? If yes — is it cached before the first action?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is <code>.unpersist()</code> called after the cached DataFrame is no longer needed?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is <code>StorageLevel.MEMORY_AND_DISK</code> used instead of <code>MEMORY_ONLY</code>?</li>
</ul>
<p><strong>Joins</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is every join between a large and small table using <code>broadcast()</code>?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is any large-to-large join repeated on the same key? If yes — is bucketing being used?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Are there any skewed keys? Check the Spark UI for tasks with 10x–100x longer runtimes than others in the same stage.</li>
</ul>
<p><strong>Functions &amp; UDFs</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is every Python UDF replaceable with a <code>pyspark.sql.functions</code> built-in?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->If a UDF is unavoidable, is it a Pandas UDF (vectorized) rather than a row-by-row Python UDF?</li>
</ul>
<p><strong>Reading Data</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Are only needed columns selected at read time (not <code>select *</code> after transformation)?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is the partition filter applied immediately on the read result, on the partition column itself?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Does <code>df.explain()</code> show <code>PartitionFilters</code> and <code>PushedFilters</code> as non-empty?</li>
</ul>
<p><strong>Cluster Configuration</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is <code>spark.executor.cores</code> set to 4–5 (not the default of 1)?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is <code>spark.executor.memory</code> calculated from actual node RAM, not left at the 1g default?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is dynamic allocation enabled for variable-length workloads?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is <code>spark.driver.memory</code> set high enough to handle broadcast tables without OOM?</li>
</ul>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-lessons">Key Lessons<a href="https://www.recodehive.com/blog/spark-performance-optimizations#key-lessons" class="hash-link" aria-label="Direct link to Key Lessons" title="Direct link to Key Lessons" translate="no">​</a></h2>
<p><strong>The Spark UI is your fastest debugging tool.</strong> Every mistake above shows up in the Spark UI before you ever look at the code: long stage durations from wrong partition counts, skewed task distribution from join issues, tiny data sizes per task from over-partitioning, zero partition filters from missed pushdown. Open the UI first, read the physical plan second, look at the code third.</p>
<p><strong>PySpark never stops you from writing the slow version.</strong> The job runs either way. The only difference is whether it finishes in 34 minutes or 4 hours. Spark assumes you know what you're doing — which means the performance consequences of defaults are entirely invisible until you look for them.</p>
<p><strong>Built-in functions exist for almost everything.</strong> The instinct to reach for a Python UDF is understandable — Python is what most data engineers know best. But the <code>pyspark.sql.functions</code> module covers an enormous surface area: string manipulation, date arithmetic, array operations, conditional logic, window functions. A 5-minute search through the docs is almost always faster than the performance penalty of writing and maintaining a UDF.</p>
<p><strong>Optimization compounds.</strong> None of the six fixes above is independent. Fixing the partition count makes the join faster. Fixing the join makes caching more effective. Fixing the read makes everything upstream cheaper. Start with the fix that addresses the largest stage duration in the Spark UI and work down from there.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="frequently-asked-questions">Frequently Asked Questions<a href="https://www.recodehive.com/blog/spark-performance-optimizations#frequently-asked-questions" class="hash-link" aria-label="Direct link to Frequently Asked Questions" title="Direct link to Frequently Asked Questions" translate="no">​</a></h2>
<p><strong>Q: How do I know if my DataFrame is actually being cached or if Spark is silently dropping it?</strong>
A: The Storage tab in the Spark UI is the fastest way to check. Cached DataFrames show up there with their storage level, what fraction of the data was actually stored, and how much memory it consumed. If nothing shows up after an action runs, it either means the cache hasn't been triggered yet — caching is lazy, so it only materializes on the first action — or Spark evicted it because executor memory filled up. We switched everything to <code>MEMORY_AND_DISK</code> after getting burned by a silent eviction that caused a job to recompute a 20-minute stage we thought was cached. Under that storage level, Spark spills to disk instead of evicting, so you at least keep the result.</p>
<p><strong>Q: Is the broadcast join threshold configurable? What if my "small" table is 300MB?</strong>
A: Yes, Spark's default auto-broadcast threshold is 10MB, which is conservative. We've raised it to 300MB on tables we know are stable in size:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.autoBroadcastJoinThreshold"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">300</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">*</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1024</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">*</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1024</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>One thing we learned: don't go above 500MB without testing carefully. The driver has to collect the entire table into memory before broadcasting it out, and if you push that too high you'll see the driver OOM before the broadcast even starts and the error message isn't always obvious about what caused it.</p>
<p><strong>Q: Should I always use <code>spark.sql.adaptive.enabled</code>? Are there downsides?</strong>
A: We run it on everything now and haven't regretted it. AQE has genuinely saved us from bad shuffle partition counts more than once, particularly on pipelines where the data volume varies day to day and our static estimate was off. The one scenario where we saw it cause slowdowns was on a particularly complex query plan with 20+ joins, where AQE's planning overhead added more time than the optimization saved. We turned it off for that specific job and kept it on everywhere else.</p>
<p><strong>Q: How do I find join skew in the Spark UI?</strong>
A: Go to the Stages tab and look for a stage where the Max task duration is dramatically higher than the Median, anything above a 5x ratio is worth investigating. We had a stage once where the median task took 3 seconds and one task took 47 minutes. That's the classic skew signature: one executor holding a massive key while the rest of the cluster finishes and sits idle. Click into the stage, look at the task duration histogram, and if you see one bar far to the right while everything else clusters near zero, you've found it.</p>
<p><strong>Q: What's the difference between <code>.cache()</code> and <code>.persist()</code>?</strong>
A: In practice, we always use <code>.persist(StorageLevel.MEMORY_AND_DISK)</code> explicitly and skip <code>.cache()</code> entirely. The behavior of <code>.cache()</code> has changed across Spark versions, in some versions it defaults to <code>MEMORY_ONLY</code>, in others <code>MEMORY_AND_DISK</code>. Rather than remember which version does what, we just use the explicit form. It takes four more characters to type and removes all ambiguity.</p>
<p><strong>Q: Can I over-partition? Is more shuffle partitions always safer?</strong>
A: Yes, over-partitioning is a real problem and we've hit it. We had a pipeline where someone had set shuffle partitions to 1000 "to be safe" on a 4GB dataset. The Spark UI showed 1000 tasks completing in under 100ms each, the entire stage was task scheduling overhead, not computation. Spark's scheduler has to launch, track, and retire each task individually, and at 1000 tasks on 4GB of data, that bookkeeping cost more than the actual work. If you see a stage in the UI where every task completes in milliseconds, that's the sign you're over-partitioned. Drop the count by 4x and re-run.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references-and-further-reading">References and Further Reading<a href="https://www.recodehive.com/blog/spark-performance-optimizations#references-and-further-reading" class="hash-link" aria-label="Direct link to References and Further Reading" title="Direct link to References and Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html" target="_blank" rel="noopener noreferrer" class="">Apache Spark - Performance Tuning Guide</a></li>
<li class=""><a href="https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution" target="_blank" rel="noopener noreferrer" class="">Apache Spark - Adaptive Query Execution</a></li>
<li class=""><a href="https://docs.delta.io/latest/optimizations-oss.html" target="_blank" rel="noopener noreferrer" class="">Delta Lake - Optimizations and Best Practices</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/databricks/transform/optimize-joins" target="_blank" rel="noopener noreferrer" class="">Databricks - Optimize PySpark Joins</a></li>
<li class=""><a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html" target="_blank" rel="noopener noreferrer" class="">PySpark SQL Functions Reference</a></li>
<li class=""><a href="https://www.recodehive.com/blog/azure-cost-optimization" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Azure Data Pipeline Cost Optimization</a></li>
<li class=""><a href="https://www.recodehive.com/blog/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Medallion Architecture Explained</a></li>
<li class=""><a href="https://www.recodehive.com/blog/azure-cost-optimization" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Hidden Cost of Streaming Pipelines</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/spark-performance-optimizations#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p><strong>Aditya Singh Rathore</strong> is a Data Engineer focused on building modern, scalable data platforms on Azure and Databricks. He writes about data engineering, cloud architecture, and real-world pipelines on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a> turning hard-won production lessons into content anyone can apply.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>pyspark</category>
            <category>spark</category>
            <category>optimization</category>
            <category>data-engineering</category>
            <category>delta-lake</category>
            <category>performance</category>
            <category>partitioning</category>
            <category>joins</category>
            <category>caching</category>
            <category>cluster-config</category>
        </item>
        <item>
            <title><![CDATA[Azure Data Pipeline Cost Optimization: How We Cut a $4,200 Bill by 73%]]></title>
            <link>https://www.recodehive.com/blog/azure-cost-optimization</link>
            <guid>https://www.recodehive.com/blog/azure-cost-optimization</guid>
            <pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Six costly mistakes in a 2GB Azure data pipeline drove a $4,200 monthly bill. Here's every mistake, root cause, and fix — with before-and-after numbers for each.]]></description>
            <content:encoded><![CDATA[<p>The Azure billing email arrived on the first of the month. <strong>$4,247.83.</strong></p>
<p>Our pipeline processed roughly 2GB of sales data per day and served a Power BI dashboard to 30 users. There was no logical reason for a bill that size. Over the next three days, a line-by-line audit of Azure Cost Analysis revealed not one big mistake but <strong>six medium-sized ones</strong>, each silently running up costs in parallel, invisible until the invoice arrived.</p>
<p>This post is that investigation: every mistake explained, every fix documented, and the exact before-and-after numbers. If you're building data pipelines on Azure and haven't audited your costs recently, at least one of these is probably happening to you right now.</p>
<p><strong>What you'll learn in this post:</strong></p>
<ul>
<li class="">Why a Dedicated SQL Pool running 24/7 is the single most expensive default mistake in Azure Synapse</li>
<li class="">How to replace nightly full loads with watermark-based incremental loads in Azure Data Factory</li>
<li class="">How to right-size Spark pools and configure auto-termination to stop paying for idle compute</li>
<li class="">How partition pruning on Delta Lake tables can reduce data scanned by over 90%</li>
<li class="">How ADLS Gen2 lifecycle policies passively save money on storage with zero ongoing effort</li>
<li class="">When a scheduled micro-batch replaces a 24/7 streaming pipeline without any business impact</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-pipeline-architecture">The Pipeline Architecture<a href="https://www.recodehive.com/blog/azure-cost-optimization#the-pipeline-architecture" class="hash-link" aria-label="Direct link to The Pipeline Architecture" title="Direct link to The Pipeline Architecture" translate="no">​</a></h2>
<p>Before the mistakes make sense, here is the full pipeline:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Daily sales data from REST API (~2GB/day)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADF Pipeline (ingestion)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADLS Gen2 — bronze/ (raw Parquet files)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Spark job (transformation)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADLS Gen2 — silver/ (Delta tables)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Dedicated SQL Pool (serving layer)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Power BI dashboard (30 users)</span><br></div></code></pre></div></div>
<p>A standard <a href="https://www.recodehive.com/blog/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">Medallion Architecture</a>, nothing exotic. 2GB of data per day. 30 users. Should have cost a few hundred dollars a month at most. It cost <strong>~$4,247.83</strong>. Here is exactly why.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-1-dedicated-sql-pool-running-247">Mistake #1: Dedicated SQL Pool Running 24/7<a href="https://www.recodehive.com/blog/azure-cost-optimization#mistake-1-dedicated-sql-pool-running-247" class="hash-link" aria-label="Direct link to Mistake #1: Dedicated SQL Pool Running 24/7" title="Direct link to Mistake #1: Dedicated SQL Pool Running 24/7" translate="no">​</a></h2>
<p><strong>Monthly cost of this mistake: ~$1,800</strong></p>
<p>This was the single largest line item. A Dedicated SQL Pool at DW200c was provisioned to serve the Power BI dashboard and left running continuously 24 hours a day, 7 days a week because auto-pause had never been configured.</p>
<p>The thing that surprised me most when I first dug into the bill was this: the SQL Pool charged us the same rate at 3am on a Saturday as it did during peak usage on a Tuesday afternoon. I had assumed, naively, that there was some kind of idle detection built in. There isn't. When it's provisioned, you're paying - full stop, whether a single query runs or not. Our 30 users were active between 9am and 6pm on weekdays, 45 hours of actual usage per week. The pool was running for 168 hours per week. That's 123 hours of idle, fully-billed compute every single week, and it showed up on our invoice as a flat $1,800 charge with no breakdown by usage.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>The SQL Pool doesn't throttle billing when it's idle, provisioned capacity is billed by the hour regardless of query activity. Pausing the pool is the only way to stop the DWU clock. Storage costs continue when paused, but the compute component which is the large part, stops completely.</p></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-fix-auto-pause-with-azure-automation-runbooks">The Fix: Auto-Pause with Azure Automation Runbooks<a href="https://www.recodehive.com/blog/azure-cost-optimization#the-fix-auto-pause-with-azure-automation-runbooks" class="hash-link" aria-label="Direct link to The Fix: Auto-Pause with Azure Automation Runbooks" title="Direct link to The Fix: Auto-Pause with Azure Automation Runbooks" translate="no">​</a></h3>
<p>The solution is two Azure Automation runbooks, one to pause the pool at the end of business hours, one to resume it in the morning. The runbooks use Managed Identity for authentication, which avoids hardcoding credentials.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">pause-sql-pool.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Azure Automation Runbook — pause Synapse SQL Pool outside business hours</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> azure</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">identity </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> ManagedIdentityCredential</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> azure</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mgmt</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">synapse </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SynapseManagementClient</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">credential </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> ManagedIdentityCredential</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">client </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> SynapseManagementClient</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">credential</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> subscription_id</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Pause at 7pm weekdays</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">client</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql_pools</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">begin_pause</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    resource_group_name</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    workspace_name</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    sql_pool_name</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Schedule the pause runbook at 7pm weekdays and the resume runbook at 8am. Weekends stay paused unless a manual override is triggered through the Azure portal.</p>
<p><strong>Result:</strong> Billed hours dropped from 720 to roughly 210 per month. SQL Pool cost fell from ~$1,800/month to ~$530/month, a saving of <strong>$1,270/month</strong>.</p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>If your workload is exploratory rather than dashboard-serving, consider whether <a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview" target="_blank" rel="noopener noreferrer" class="">Serverless SQL Pool</a> is sufficient. Serverless pools bill per TB of data scanned rather than provisioned DWUs, which can be significantly cheaper for infrequent query patterns.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-2-full-load-running-every-night-instead-of-incremental">Mistake #2: Full Load Running Every Night Instead of Incremental<a href="https://www.recodehive.com/blog/azure-cost-optimization#mistake-2-full-load-running-every-night-instead-of-incremental" class="hash-link" aria-label="Direct link to Mistake #2: Full Load Running Every Night Instead of Incremental" title="Direct link to Mistake #2: Full Load Running Every Night Instead of Incremental" translate="no">​</a></h2>
<p><strong>Monthly cost of this mistake: ~$620</strong></p>
<p>The ADF pipeline was configured to pull <strong>all records</strong> from the source database on every nightly run not just new or updated ones. Day 1 it pulled 2GB. By Day 60 it was pulling 120GB, processing records that had already been processed 59 times before.</p>
<p>This pattern is extremely common and extremely expensive. Every night, the ADF pipeline read the entire historical dataset, the Spark transformation job processed all of it, and the results were written back to Delta. The billable compute scaled with the dataset size, not with the actual volume of new data.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-fix-watermark-based-incremental-loading">The Fix: Watermark-Based Incremental Loading<a href="https://www.recodehive.com/blog/azure-cost-optimization#the-fix-watermark-based-incremental-loading" class="hash-link" aria-label="Direct link to The Fix: Watermark-Based Incremental Loading" title="Direct link to The Fix: Watermark-Based Incremental Loading" translate="no">​</a></h3>
<p>A watermark stores the timestamp of the last successfully processed record. Every pipeline run reads only records newer than that timestamp, then updates the watermark on success.</p>
<p>The implementation in ADF uses two Lookup activities and a query parameterized by the watermark value:</p>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Query</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">Output</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">get-watermark.sql</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">-- Step 1: ADF Lookup activity — retrieve the last watermark</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">SELECT</span><span class="token plain"> last_processed_date</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">FROM</span><span class="token plain"> pipeline_watermarks</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">WHERE</span><span class="token plain"> pipeline_name </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'sales_ingestion'</span><br></div></code></pre></div></div><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">incremental-source-query.sql</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">-- Step 2: Source query, parameterized by watermark from Step 1</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">SELECT</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">*</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">FROM</span><span class="token plain"> orders</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">WHERE</span><span class="token plain"> updated_at </span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'@{activity("GetWatermark").output.firstRow.last_processed_date}'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token operator" style="color:#393A34">AND</span><span class="token plain"> updated_at </span><span class="token operator" style="color:#393A34">&lt;=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'@{utcnow()}'</span><br></div></code></pre></div></div><div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">update-watermark.sql</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">-- Step 3: After successful pipeline run, advance the watermark</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">UPDATE</span><span class="token plain"> pipeline_watermarks</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">SET</span><span class="token plain"> last_processed_date </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'@{utcnow()}'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">WHERE</span><span class="token plain"> pipeline_name </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'sales_ingestion'</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><table><thead><tr><th>Pipeline Run</th><th>Records Processed</th><th>Data Volume</th><th>ADF Cost</th></tr></thead><tbody><tr><td>Before (full load, Day 60)</td><td>~4.8M rows</td><td>~120 GB</td><td>~$22/run</td></tr><tr><td>After (incremental)</td><td>~8,000 rows</td><td>~2 GB</td><td>~$0.40/run</td></tr></tbody></table></div></div></div>
<p><strong>Result:</strong> ADF activity runtime dropped by 94%. Spark compute for the transformation step fell proportionally. Combined saving: <strong>~$585/month</strong>.</p>
<p><img decoding="async" loading="lazy" alt="Azure Data Factory pipeline canvas showing the Lookup (GetWatermark) → Copy Data → Stored Procedure watermark pattern" src="https://www.recodehive.com/assets/images/03-adf-watermark-pipeline-canvas-9a7edbe88298ce2ba2439d437d5e2e99.png" width="1168" height="421" class="img_ev3q"></p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>The watermark pattern requires a reliable <code>updated_at</code> or <code>created_at</code> column in the source table. If your source does not have one, work with the source team to add it, the cost saving on the pipeline side will far outweigh the schema migration effort.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-3-spark-cluster-over-provisioned-for-the-actual-workload">Mistake #3: Spark Cluster Over-Provisioned for the Actual Workload<a href="https://www.recodehive.com/blog/azure-cost-optimization#mistake-3-spark-cluster-over-provisioned-for-the-actual-workload" class="hash-link" aria-label="Direct link to Mistake #3: Spark Cluster Over-Provisioned for the Actual Workload" title="Direct link to Mistake #3: Spark Cluster Over-Provisioned for the Actual Workload" translate="no">​</a></h2>
<p><strong>Monthly cost of this mistake: ~$480</strong></p>
<p>When setting up the Spark pool in Azure Synapse, the default node size <strong>DS3_v2 (4 cores, 14GB RAM)</strong> was selected with 5 nodes. The actual workload: transforming 2–5GB of Parquet files daily with deduplication, type casting, and a few joins.</p>
<p>Two problems compounded each other. First, the cluster was consuming roughly 10x the compute it actually needed for the data volume. Second, auto-termination was set to 60 minutes, meaning after a 12-minute job, the cluster sat idle and fully billed for another 48 minutes before shutting down.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-fix-right-sizing-autoscale-and-fast-termination">The Fix: Right-Sizing, Autoscale, and Fast Termination<a href="https://www.recodehive.com/blog/azure-cost-optimization#the-fix-right-sizing-autoscale-and-fast-termination" class="hash-link" aria-label="Direct link to The Fix: Right-Sizing, Autoscale, and Fast Termination" title="Direct link to The Fix: Right-Sizing, Autoscale, and Fast Termination" translate="no">​</a></h3>
<p>The fix has three components that work together:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">synapse-spark-pool-config.json</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"nodeSize"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Small"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"minNodeCount"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"maxNodeCount"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">4</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"autoscaleEnabled"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"autoTerminationEnabled"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"autoTerminationDelayInMinutes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">5</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>The third component is tuning the shuffle partition count inside the Spark notebook. The default of 200 partitions is calibrated for large clusters and large datasets. For 2–5GB of data on a small cluster, 200 partitions creates unnecessary overhead that extends job runtime and therefore billed compute time.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">spark-notebook-config.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Place in the first cell of every Spark notebook</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Default is 200 partitions — designed for multi-TB workloads</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># For 2–5 GB datasets, 8 is appropriate</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">conf</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">set</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.shuffle.partitions"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"8"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p>A good rule of thumb for <code>spark.sql.shuffle.partitions</code>: aim for roughly <strong>128MB of data per partition</strong>. For a 2GB dataset, that's approximately 16 partitions. Err slightly lower rather than higher for small datasets on small clusters.</p></div></div>
<p><strong>Result:</strong> Spark compute cost dropped from ~$580/month to ~$100/month, a saving of <strong>$480/month</strong>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-4-reading-adls-gen2-files-without-partition-pruning">Mistake #4: Reading ADLS Gen2 Files Without Partition Pruning<a href="https://www.recodehive.com/blog/azure-cost-optimization#mistake-4-reading-adls-gen2-files-without-partition-pruning" class="hash-link" aria-label="Direct link to Mistake #4: Reading ADLS Gen2 Files Without Partition Pruning" title="Direct link to Mistake #4: Reading ADLS Gen2 Files Without Partition Pruning" translate="no">​</a></h2>
<p><strong>Monthly cost of this mistake: ~$290</strong></p>
<p>The Silver layer Delta table was partitioned by <code>order_date</code>. The Spark transformation job, however, was reading the entire table and applying a date filter <em>after</em> the read not during it.</p>
<p>Think of it like this: imagine your filing cabinet is organized by month, one drawer per month, clearly labelled. Partition pruning is pulling open only January's drawer. What we were doing instead was dumping every drawer onto the floor, sifting through three years of paper, and throwing away everything that wasn't from January then tidying it all back up. Every. Single. Night. The cabinet is organized correctly. We just weren't using the labels.</p>
<p>With 90 days of accumulated history, this approach was scanning 90x more data than necessary on every run. The fix is to push the filter into the read itself so Spark can use the partition directory structure to skip everything irrelevant before a single file is opened.</p>
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">Before (Expensive)</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">After (Optimized)</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">transformation-before.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Reads the ENTIRE Delta table, then filters in memory</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># With 90 days of history: scans ~180GB to get ~2GB of useful data</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">silver_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"abfss://data@mylake.dfs.core.windows.net/silver/sales/"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">filtered_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> silver_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"order_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> yesterday</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">transformation-after.py</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> datetime </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> datetime</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> timedelta</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">yesterday </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">datetime</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">now</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token plain"> timedelta</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">days</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">strftime</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"%Y-%m-%d"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Filter pushed into the read — Spark only opens yesterday's partition</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># With 90 days of history: scans ~2GB instead of ~180GB</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">silver_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"abfss://data@mylake.dfs.core.windows.net/silver/sales/"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"order_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> yesterday</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div></div></div></div>
<p><strong>Result:</strong> Data scanned per run dropped from ~180GB to ~2GB. Spark runtime fell from 18 minutes to 4 minutes. Monthly saving: <strong>~$260/month</strong>.</p>
<div class="theme-admonition theme-admonition-note admonition_xJq3 alert alert--secondary"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M6.3 5.69a.942.942 0 0 1-.28-.7c0-.28.09-.52.28-.7.19-.18.42-.28.7-.28.28 0 .52.09.7.28.18.19.28.42.28.7 0 .28-.09.52-.28.7a1 1 0 0 1-.7.3c-.28 0-.52-.11-.7-.3zM8 7.99c-.02-.25-.11-.48-.31-.69-.2-.19-.42-.3-.69-.31H6c-.27.02-.48.13-.69.31-.2.2-.3.44-.31.69h1v3c.02.27.11.5.31.69.2.2.42.31.69.31h1c.27 0 .48-.11.69-.31.2-.19.3-.42.31-.69H8V7.98v.01zM7 2.3c-3.14 0-5.7 2.54-5.7 5.68 0 3.14 2.56 5.7 5.7 5.7s5.7-2.55 5.7-5.7c0-3.15-2.56-5.69-5.7-5.69v.01zM7 .98c3.86 0 7 3.14 7 7s-3.14 7-7 7-7-3.12-7-7 3.14-7 7-7z"></path></svg></span>note</div><div class="admonitionContent_BuS1"><p>For partition pruning to work, two conditions must both be true. The table must be partitioned by the filter column, and the filter must be applied at read time not in a subsequent transformation step. Applying the filter even one <code>.filter()</code> call after the <code>.load()</code> still results in a full table scan in some execution contexts.</p></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-5-keeping-historical-data-on-hot-storage-tier">Mistake #5: Keeping Historical Data on Hot Storage Tier<a href="https://www.recodehive.com/blog/azure-cost-optimization#mistake-5-keeping-historical-data-on-hot-storage-tier" class="hash-link" aria-label="Direct link to Mistake #5: Keeping Historical Data on Hot Storage Tier" title="Direct link to Mistake #5: Keeping Historical Data on Hot Storage Tier" translate="no">​</a></h2>
<p><strong>Monthly cost of this mistake: ~$180</strong></p>
<p>The ADLS Gen2 Bronze layer had 14 months of raw Parquet files sitting on the <strong>Hot</strong> storage tier. No lifecycle policy had ever been configured.</p>
<p>ADLS Gen2 charges different rates depending on the storage access tier. When I pulled our actual invoice line items, the numbers told a clear story: our Bronze container was costing us $0.023 per GB per month on Hot, while data we hadn't touched in months was sitting right next to yesterday's files paying the same rate. Moving files older than 30 days to Cool dropped that rate to roughly $0.013/GB, about 44% less for data we only needed occasionally. Files older than 180 days dropped to Archive at around $0.002/GB, which is where old Bronze raw files belong when the Silver layer already has the clean version.</p>
<p>Fourteen months of ~2GB/day accumulates to roughly 850GB in the Bronze layer. The fix required exactly one policy configuration.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-fix-adls-gen2-lifecycle-management-policy">The Fix: ADLS Gen2 Lifecycle Management Policy<a href="https://www.recodehive.com/blog/azure-cost-optimization#the-fix-adls-gen2-lifecycle-management-policy" class="hash-link" aria-label="Direct link to The Fix: ADLS Gen2 Lifecycle Management Policy" title="Direct link to The Fix: ADLS Gen2 Lifecycle Management Policy" translate="no">​</a></h3>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">lifecycle-policy.json</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"rules"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"name"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"bronze-tier-management"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Lifecycle"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"definition"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"filters"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token property" style="color:#36acaa">"prefixMatch"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"bronze/"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token property" style="color:#36acaa">"blobTypes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"blockBlob"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"actions"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token property" style="color:#36acaa">"baseBlob"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"tierToCool"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token property" style="color:#36acaa">"daysAfterModificationGreaterThan"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">30</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token property" style="color:#36acaa">"tierToArchive"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token property" style="color:#36acaa">"daysAfterModificationGreaterThan"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">180</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Bronze files automatically move to Cool after 30 days and Archive after 180 days — with no pipeline changes and no ongoing maintenance.</p>
<p><img decoding="async" loading="lazy" alt="Azure portal lifecycle management rule editor showing Base blobs configured to move to Cool after 30 days and Archive after 180 days" src="https://www.recodehive.com/assets/images/05-adls-lifecycle-details-45113212aa60a0e0c92b62d20341becf.png" width="824" height="324" class="img_ev3q"></p>
<div class="theme-admonition theme-admonition-tip admonition_xJq3 alert alert--success"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 12 16"><path fill-rule="evenodd" d="M6.5 0C3.48 0 1 2.19 1 5c0 .92.55 2.25 1 3 1.34 2.25 1.78 2.78 2 4v1h5v-1c.22-1.22.66-1.75 2-4 .45-.75 1-2.08 1-3 0-2.81-2.48-5-5.5-5zm3.64 7.48c-.25.44-.47.8-.67 1.11-.86 1.41-1.25 2.06-1.45 3.23-.02.05-.02.11-.02.17H5c0-.06 0-.13-.02-.17-.2-1.17-.59-1.83-1.45-3.23-.2-.31-.42-.67-.67-1.11C2.44 6.78 2 5.65 2 5c0-2.2 2.02-4 4.5-4 1.22 0 2.36.42 3.22 1.19C10.55 2.94 11 3.94 11 5c0 .66-.44 1.78-.86 2.48zM4 14h5c-.23 1.14-1.3 2-2.5 2s-2.27-.86-2.5-2z"></path></svg></span>tip</div><div class="admonitionContent_BuS1"><p>Apply lifecycle policies to the Silver and Gold layers too, with longer thresholds. Silver data accessed for backfills after 90 days can move to Cool. Gold data older than 365 days can move to Archive if your reporting doesn't require historical drill-downs that old.</p></div></div>
<p><strong>Result:</strong> ~$160/month in combined storage and egress savings, purely passive, set once and forgotten.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="mistake-6-a-streaming-pipeline-for-15-minute-update-requirements">Mistake #6: A Streaming Pipeline for 15-Minute Update Requirements<a href="https://www.recodehive.com/blog/azure-cost-optimization#mistake-6-a-streaming-pipeline-for-15-minute-update-requirements" class="hash-link" aria-label="Direct link to Mistake #6: A Streaming Pipeline for 15-Minute Update Requirements" title="Direct link to Mistake #6: A Streaming Pipeline for 15-Minute Update Requirements" translate="no">​</a></h2>
<p><strong>Monthly cost of this mistake: ~$380</strong></p>
<p>A secondary pipeline fed a "near real-time" inventory dashboard. The product team had asked for updates <em>as fast as possible</em>, which was interpreted as: build a Kafka + Flink streaming pipeline with always-on infrastructure.</p>
<p>What the product team actually needed, when pinned down to a specific number: inventory counts updated <strong>every 15 minutes</strong>.</p>
<p>A streaming pipeline running 24/7 to deliver 15-minute updates is the cloud equivalent of leaving your car engine running all night because you have an early meeting. The always-on Kafka cluster and Flink job cost $380/month. The business requirement was achievable with a job that runs for 2–3 minutes, 96 times a day.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-fix-micro-batch-with-adf-tumbling-window-trigger">The Fix: Micro-Batch with ADF Tumbling Window Trigger<a href="https://www.recodehive.com/blog/azure-cost-optimization#the-fix-micro-batch-with-adf-tumbling-window-trigger" class="hash-link" aria-label="Direct link to The Fix: Micro-Batch with ADF Tumbling Window Trigger" title="Direct link to The Fix: Micro-Batch with ADF Tumbling Window Trigger" translate="no">​</a></h3>
<p>An ADF Tumbling Window trigger fires the pipeline every 15 minutes. Each run reads only the delta since the last watermark, processes it, and shuts down. No infrastructure stays running between executions.</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockTitle_OeMC">tumbling-window-trigger.json</div><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"TumblingWindowTrigger"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"typeProperties"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"frequency"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Minute"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"interval"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">15</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"startTime"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"2024-01-01T00:00:00Z"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"retryPolicy"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"count"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">3</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token property" style="color:#36acaa">"intervalInSeconds"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">30</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>The pipeline runs for 2–3 minutes every 15 minutes, processes the delta since the last run using the same watermark pattern from Mistake #2, then shuts down. The product team's dashboard still updates every 15 minutes. They noticed zero difference.</p>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><p>A useful mental model for this decision: <strong>streaming is the right choice when latency requirements are below 60 seconds</strong>. For anything above that threshold, a well-designed micro-batch pipeline is almost always cheaper, simpler, easier to monitor, and easier to debug. The <a href="https://www.recodehive.com/blog/batch-vs-stream-processing" target="_blank" rel="noopener noreferrer" class="">hidden costs of streaming pipelines</a> go beyond compute they include more complex failure handling, harder-to-test logic, and longer debugging cycles.</p></div></div>
<p><strong>Result:</strong> Streaming infrastructure cost of $380/month replaced by ~$40/month in ADF + Spark compute. <strong>$340/month saved.</strong></p>
<p><img decoding="async" loading="lazy" alt="Azure Data Factory Tumbling Window trigger configuration showing 15-minute interval and retry policy" src="https://www.recodehive.com/assets/images/07-adf-tumbling-window-trigger-14e15552bd4bdae31164864001c621c4.png" width="797" height="884" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="before-and-after-summary">Before and After Summary<a href="https://www.recodehive.com/blog/azure-cost-optimization#before-and-after-summary" class="hash-link" aria-label="Direct link to Before and After Summary" title="Direct link to Before and After Summary" translate="no">​</a></h2>
<div class="theme-admonition theme-admonition-info admonition_xJq3 alert alert--info"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 14 16"><path fill-rule="evenodd" d="M7 2.3c3.14 0 5.7 2.56 5.7 5.7s-2.56 5.7-5.7 5.7A5.71 5.71 0 0 1 1.3 8c0-3.14 2.56-5.7 5.7-5.7zM7 1C3.14 1 0 4.14 0 8s3.14 7 7 7 7-3.14 7-7-3.14-7-7-7zm1 3H6v5h2V4zm0 6H6v2h2v-2z"></path></svg></span>info</div><div class="admonitionContent_BuS1"><table><thead><tr><th>Mistake</th><th>Monthly Cost Before</th><th>Monthly Cost After</th><th>Saving</th></tr></thead><tbody><tr><td>Dedicated SQL Pool running 24/7</td><td>$1,800</td><td>$530</td><td><strong>$1,270</strong></td></tr><tr><td>Full load instead of incremental</td><td>$620</td><td>$35</td><td><strong>$585</strong></td></tr><tr><td>Over-provisioned Spark cluster</td><td>$580</td><td>$100</td><td><strong>$480</strong></td></tr><tr><td>No partition pruning</td><td>$290</td><td>$30</td><td><strong>$260</strong></td></tr><tr><td>Hot storage for historical data</td><td>$180</td><td>$20</td><td><strong>$160</strong></td></tr><tr><td>Streaming for 15-min updates</td><td>$380</td><td>$40</td><td><strong>$340</strong></td></tr><tr><td><strong>Total</strong></td><td><strong>$3,850</strong></td><td><strong>$755</strong></td><td><strong>$3,095</strong></td></tr></tbody></table></div></div>
<p>From $4,247 down to approximately $1,150 after all fixes, a <strong>73% cost reduction</strong> on a pipeline doing exactly the same work on exactly the same data.</p>
<p><img decoding="async" loading="lazy" alt="Before and after bar chart showing Azure monthly cost by category, with before total of $4,247 and after total of $1,150" src="https://www.recodehive.com/assets/images/01-azure-cost-before-after-08989cbf97868d2fddd02c076af54520.png" width="2025" height="1272" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="cost-optimization-checklist">Cost Optimization Checklist<a href="https://www.recodehive.com/blog/azure-cost-optimization#cost-optimization-checklist" class="hash-link" aria-label="Direct link to Cost Optimization Checklist" title="Direct link to Cost Optimization Checklist" translate="no">​</a></h2>
<p>Run through this every quarter. Each item is a question, if the answer is "no" or "I don't know," investigate it.</p>
<p><strong>Dedicated SQL Pool</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is auto-pause configured for outside business hours?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is Dedicated SQL Pool actually required, or would Serverless SQL Pool suffice for the query pattern?</li>
</ul>
<p><strong>ADF Pipelines</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Are any pipelines running full loads where incremental loads are possible?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is a watermark implemented for every pipeline reading time-series data?</li>
</ul>
<p><strong>Spark Pools</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is node size right-sized for actual data volume, not default?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is auto-termination set to 5 minutes, not 30–60?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is <code>spark.sql.shuffle.partitions</code> tuned to actual data size?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is autoscale enabled with realistic min/max node counts?</li>
</ul>
<p><strong>ADLS Gen2</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Are lifecycle policies configured on all containers?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Are Delta tables partitioned by the column filtered most frequently?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->Is partition pruning applied at read time in all Spark notebooks?</li>
</ul>
<p><strong>Streaming Infrastructure</strong></p>
<ul class="contains-task-list containsTaskList_mC6p">
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->What is the actual latency requirement, in minutes?</li>
<li class="task-list-item"><input type="checkbox" disabled=""> <!-- -->If it is above 5 minutes — is a micro-batch pipeline in use instead of always-on streaming?</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-lessons">Key Lessons<a href="https://www.recodehive.com/blog/azure-cost-optimization#key-lessons" class="hash-link" aria-label="Direct link to Key Lessons" title="Direct link to Key Lessons" translate="no">​</a></h2>
<p><strong>Defaults are expensive by design.</strong> Azure's defaults - 60-minute Spark termination, no SQL Pool auto-pause, no lifecycle policies are chosen for zero-friction setup, not cost efficiency. Every default should be reviewed and overridden deliberately on day one, not after the first billing surprise.</p>
<p><strong>Incremental loading is not a future optimization.</strong> For any pipeline reading time-series data from a growing source, a full load that runs daily compounds in cost the same way interest compounds on debt. The watermark pattern takes a few hours to implement and pays for itself within a week.</p>
<p><strong>Partition pruning is free performance.</strong> Setting up partitioning correctly at table creation and pushing filters into the read step costs nothing to implement and can reduce Spark compute by over 90%. The only requirement is knowing which column you filter on most frequently which you almost certainly already know.</p>
<p><strong>"Real-time" almost never means real-time.</strong> The product team's requirement was 15-minute updates. The engineering interpretation was 24/7 streaming infrastructure. The gap between those two decisions cost $340/month and made the pipeline significantly harder to maintain. Before designing streaming, ask for a specific latency number, then design to that number.</p>
<p><strong>Azure Cost Analysis is a weekly habit, not a monthly emergency.</strong> The six mistakes above were invisible until the invoice arrived. Fifteen minutes a week in <a href="https://learn.microsoft.com/en-us/azure/cost-management-billing/" target="_blank" rel="noopener noreferrer" class="">Azure Cost Management</a> catches problems while they are a $50 anomaly, not a $500 line item.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="frequently-asked-questions">Frequently Asked Questions<a href="https://www.recodehive.com/blog/azure-cost-optimization#frequently-asked-questions" class="hash-link" aria-label="Direct link to Frequently Asked Questions" title="Direct link to Frequently Asked Questions" translate="no">​</a></h2>
<p><strong>Q: Should I always use Serverless SQL Pool instead of Dedicated SQL Pool to save costs?</strong>
A: We actually tested this switch on our own setup before committing to the auto-pause approach. Serverless made sense for us until we crossed about 6 hours of daily query time, below that threshold, serverless was cheaper every single month without exception, and we didn't need to manage any pause/resume scheduling at all. If your Power BI dashboard is hit heavily throughout the business day, dedicated will eventually win on pure cost. But if usage is bursty or confined to a few hours, don't provision dedicated capacity and then fight to keep it paused, just go serverless from the start.</p>
<p><strong>Q: What if my source system doesn't have a reliable <code>updated_at</code> column for watermarking?</strong>
A: We ran into this with one of our source databases, no timestamp, no audit column, nothing. We ended up going the CDC route using Debezium, which captures row-level changes at the database log level without touching the source schema at all. It took about a day to set up and was the cleanest solution we found. For append-only tables, an auto-incrementing primary key works as a watermark substitute. If neither option exists, row hash comparison is a last resort, it detects changes, but you're still reading the full source every run, which defeats most of the point.</p>
<p><strong>Q: How do I choose the right partition column for a Delta Lake table?</strong>
A: The short answer is: whichever column appears in your most common filter is your partition column, and for time-series data that's almost always a date. What I'd caution against is partitioning on something high-cardinality like user ID or transaction ID, we made that mistake early on and ended up with thousands of tiny files that made partition pruning useless and actually slowed down reads. A healthy partition should hold somewhere between 100MB and 1GB of data. If a partition is smaller than that, you're creating file overhead without the pruning benefit.</p>
<p><strong>Q: Will moving data to Archive tier in ADLS Gen2 break my backfill pipelines?</strong>
A: I learned this the hard way, a backfill job failed at 2am because Archive data doesn't just open like a normal file. You have to explicitly rehydrate it first, and depending on the priority tier you choose, that can take anywhere from an hour to fifteen hours. We got caught by this once, and now the rule on our team is: only archive data where we have at least 24 hours of lead time if a backfill request comes in. For Bronze raw files older than 180 days, that's usually fine, nobody's doing emergency backfills on six-month-old raw data. For Silver and Gold, we stop at Cool tier and don't go to Archive at all, because the rehydration wait is unacceptable mid-incident.</p>
<p><strong>Q: Is a 5-minute Spark auto-termination window too aggressive for interactive notebooks?</strong>
A: For scheduled production jobs, 5 minutes is ideal, the job finishes, and within 5 minutes the cluster is gone and billing stops. For interactive development work, 5 minutes will drive you crazy because the cluster spins down between notebook cells if you pause to think. What we do is maintain two separate Spark pool configurations: one for production jobs set to 5-minute termination, and one for dev work set to 60 minutes. They run on different node sizes too. Keep them separate and you get cost efficiency in production without interrupting development flow.</p>
<p><strong>Q: How do I detect whether partition pruning is actually working in my Spark job?</strong>
A: The fastest way is to run <code>df.explain()</code> and look at the physical plan output. If partition pruning is active, you'll see a <code>PartitionFilters</code> entry in the scan node listing your filter predicate. If that field shows <code>PartitionFilters: []</code> - empty brackets, you're scanning the full table regardless of what your code looks like. I've caught this bug three times by checking the plan after what looked like a correctly written filter, and each time it turned out the filter was being applied one transformation step too late, after Spark had already committed to a full read.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references-and-further-reading">References and Further Reading<a href="https://www.recodehive.com/blog/azure-cost-optimization#references-and-further-reading" class="hash-link" aria-label="Direct link to References and Further Reading" title="Direct link to References and Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/cost-management-billing/" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - Azure Cost Management and Billing</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/pause-and-resume-compute-portal" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - Synapse SQL Pool Pause and Resume</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - ADLS Gen2 Lifecycle Management Policies</a></li>
<li class=""><a href="https://docs.delta.io/latest/optimizations-oss.html" target="_blank" rel="noopener noreferrer" class="">Delta Lake - Partition Pruning and Optimization</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - ADF Tumbling Window Trigger</a></li>
<li class=""><a href="https://www.recodehive.com/blog/azure-storage-options" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Azure Storage and ADLS Gen2 Complete Guide</a></li>
<li class=""><a href="https://www.recodehive.com/blog/batch-vs-stream-processing" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Hidden Cost of Streaming Pipelines</a></li>
<li class=""><a href="https://www.recodehive.com/blog/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Medallion Architecture Explained</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/azure-cost-optimization#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p><strong>Aditya Singh Rathore</strong> is a Data Engineer focused on building modern, scalable data platforms on Azure. He writes about data engineering, cloud architecture, and real-world pipelines on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a> — turning hard-won production lessons into content anyone can apply.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Got an Azure bill that surprised you? Drop the line item in the comments — happy to help debug it.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>azure</category>
            <category>cost-optimization</category>
            <category>data-pipeline</category>
            <category>adls-gen2</category>
            <category>azure-synapse</category>
            <category>apache-spark</category>
            <category>azure-data-factory</category>
            <category>delta-lake</category>
            <category>data-engineering</category>
            <category>cloud-cost</category>
        </item>
        <item>
            <title><![CDATA[Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare]]></title>
            <link>https://www.recodehive.com/blog/medallion-architecture</link>
            <guid>https://www.recodehive.com/blog/medallion-architecture</guid>
            <pubDate>Thu, 07 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Most data pipelines don't fail because of bad technology. They fail because raw data flows directly into reports with no checkpoints, no validation, and no clear ownership. Medallion Architecture fixes exactly this — here's how it works, why it matters, and how to implement it in practice.]]></description>
            <content:encoded><![CDATA[<p>It was a Tuesday afternoon when our analytics lead sent a message that made my stomach drop.</p>
<p><em>"The revenue numbers in the dashboard don't match what finance is reporting. We're off by $180,000. Can you check the pipeline?"</em></p>
<p>I spent the next four hours tracing data through a tangled mess of transformations, none of them documented, some running directly on raw API responses, others written six months ago by someone who had since left the team. By the time I found the issue (a deduplication step that had silently stopped working after a schema change upstream), the damage was done. Three teams had been working off wrong numbers for two weeks.</p>
<p>That incident is what introduced me to <strong>Medallion Architecture</strong>.</p>
<p>Not as a concept from a blog post. As a solution to a real, expensive, embarrassing problem that could have been caught immediately if we'd had any structure in how data moved through our pipeline.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="so-what-is-it">So, What Is It?<a href="https://www.recodehive.com/blog/medallion-architecture#so-what-is-it" class="hash-link" aria-label="Direct link to So, What Is It?" title="Direct link to So, What Is It?" translate="no">​</a></h2>
<p>Think of Medallion Architecture like a water filtration system.</p>
<p>Water from a river (your raw data) goes through multiple stages of filtering before it's safe to drink (your final reports). You wouldn't drink straight from the river — and you shouldn't build reports directly on raw, unvalidated data either.</p>
<p>The architecture divides your data journey into three layers:</p>
<blockquote>
<p><strong>Bronze → Silver → Gold</strong></p>
</blockquote>
<p>Each layer has one job. Each layer makes the data a little more trustworthy. By the time data reaches the end, it's reliable, consistent, and ready to power real business decisions.</p>
<p><img decoding="async" loading="lazy" alt="Three-layer Medallion Architecture flow diagram" src="https://www.recodehive.com/assets/images/medallion-architecture-flow-d57a4fd87013cac64a88a23eebe3dff6.png" width="1672" height="941" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-bronze-the-keep-everything-layer">🥉 Bronze: The "Keep Everything" Layer<a href="https://www.recodehive.com/blog/medallion-architecture#-bronze-the-keep-everything-layer" class="hash-link" aria-label="Direct link to 🥉 Bronze: The &quot;Keep Everything&quot; Layer" title="Direct link to 🥉 Bronze: The &quot;Keep Everything&quot; Layer" translate="no">​</a></h2>
<p>Bronze is where data arrives, exactly as it came from the source. No cleaning, no filtering, no judgment.</p>
<p>APIs, databases, logs, CSV exports, it all lands here, untouched.</p>
<p>After the revenue incident, the first thing we did was create a Bronze layer in ADLS Gen2, a dedicated folder where every raw API response landed as-is, timestamped, and never overwritten.</p>
<p><strong>Why not clean it immediately?</strong></p>
<p>Because you <em>will</em> make mistakes in your pipeline. And when you do, you need to be able to go back to the original data and start over, without re-calling the API, without re-pulling from a source that may have already changed.</p>
<p>Bronze is your safety net. It's immutable, append-only, and complete.</p>
<blockquote>
<p><strong>Think of it as your data's long-term memory</strong>, messy, raw, but irreplaceable.</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-bronze-looks-like-in-practice">What Bronze looks like in practice<a href="https://www.recodehive.com/blog/medallion-architecture#what-bronze-looks-like-in-practice" class="hash-link" aria-label="Direct link to What Bronze looks like in practice" title="Direct link to What Bronze looks like in practice" translate="no">​</a></h3>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">adls-gen2/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  └── bronze/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        └── sales/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              └── 2024/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    ├── 01/raw_orders_20240115.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    ├── 02/raw_orders_20240201.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    └── 03/raw_orders_20240305.parquet</span><br></div></code></pre></div></div>
<p>Files land here partitioned by date. Nothing is modified after landing. If the pipeline fails three steps later, you don't re-ingest, you reprocess from Bronze.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-rules-for-bronze">Key rules for Bronze<a href="https://www.recodehive.com/blog/medallion-architecture#key-rules-for-bronze" class="hash-link" aria-label="Direct link to Key rules for Bronze" title="Direct link to Key rules for Bronze" translate="no">​</a></h3>
<ul>
<li class=""><strong>Append only</strong>: never overwrite or delete records</li>
<li class=""><strong>No transformation</strong>: store exactly what the source sent, including bad records</li>
<li class=""><strong>Schema as-received</strong>: don't enforce structure here, even if the source changes its format</li>
<li class=""><strong>Partition by ingestion date</strong>: makes reprocessing specific time ranges simple</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-silver-where-the-real-work-happens">🥈 Silver: Where the Real Work Happens<a href="https://www.recodehive.com/blog/medallion-architecture#-silver-where-the-real-work-happens" class="hash-link" aria-label="Direct link to 🥈 Silver: Where the Real Work Happens" title="Direct link to 🥈 Silver: Where the Real Work Happens" translate="no">​</a></h2>
<p>This is where data engineering gets interesting and where most of the actual work lives.</p>
<p>In the Silver layer, you take everything from Bronze and make it usable:</p>
<ul>
<li class=""><strong>Deduplicate</strong> - remove duplicate records from retry logic or overlapping ingestion windows</li>
<li class=""><strong>Standardize</strong> - dates in ISO format, currencies in base units, strings trimmed and consistent</li>
<li class=""><strong>Validate</strong> - flag or quarantine records that fail business rules (negative prices, missing required fields)</li>
<li class=""><strong>Enforce schema</strong> - write Delta tables with defined column types and constraints</li>
<li class=""><strong>Enrich</strong> - join raw records with reference data (product names, region codes, customer tiers)</li>
</ul>
<p>Most of the heavy lifting in a data pipeline lives here. It's not glamorous work but it's what separates trustworthy analytics from chaos.</p>
<blockquote>
<p><strong>Think of it as the editorial desk</strong>, messy raw material goes in, clean, consistent content comes out.</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-silver-looks-like-in-practice">What Silver looks like in practice<a href="https://www.recodehive.com/blog/medallion-architecture#what-silver-looks-like-in-practice" class="hash-link" aria-label="Direct link to What Silver looks like in practice" title="Direct link to What Silver looks like in practice" translate="no">​</a></h3>
<p>Here's a simple PySpark transformation from Bronze to Silver:</p>
<ul>
<li class=""><a href="https://oneuptime.com/blog/post/2026-02-17-how-to-build-a-data-lakehouse-architecture-on-gcp-using-cloud-storage-dataproc-and-bigquery/view" target="_blank" rel="noopener noreferrer" class="">Reference code</a></li>
</ul>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SparkSession</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functions </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> col</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> to_date</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> lower</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> trim</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> when</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> SparkSession</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">builder</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">appName</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"BronzeToSilver"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getOrCreate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Read from Bronze</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">bronze_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"parquet"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"abfss://data@mylake.dfs.core.windows.net/bronze/sales/2024/"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Clean and validate</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">silver_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    bronze_df</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">dropDuplicates</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"order_id"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">                              </span><span class="token comment" style="color:#999988;font-style:italic"># deduplicate</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"order_date"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> to_date</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"order_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"yyyy-MM-dd"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"region"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> lower</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">trim</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"region"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">          </span><span class="token comment" style="color:#999988;font-style:italic"># standardize</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"product"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> lower</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">trim</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"product"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">withColumn</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token string" style="color:#e3116c">"is_valid"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        when</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"amount"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">otherwise</span><span class="token punctuation" style="color:#393A34">(</span><span class="token boolean" style="color:#36acaa">False</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">        </span><span class="token comment" style="color:#999988;font-style:italic"># validate</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"order_id"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">isNotNull</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain">                       </span><span class="token comment" style="color:#999988;font-style:italic"># remove nulls</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Write to Silver as Delta table</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    silver_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"overwrite"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">option</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"overwriteSchema"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"true"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"abfss://data@mylake.dfs.core.windows.net/silver/sales/"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">print</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string-interpolation string" style="color:#e3116c">f"Silver layer written: </span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">{</span><span class="token string-interpolation interpolation">silver_df</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">.</span><span class="token string-interpolation interpolation">count</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">(</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">)</span><span class="token string-interpolation interpolation punctuation" style="color:#393A34">}</span><span class="token string-interpolation string" style="color:#e3116c"> records"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The deduplication step alone would have prevented our $180,000 revenue discrepancy. The raw Bronze data had duplicate order records from a retry bug in the API client. Silver catches them. Gold never sees them.</p>
<p>One big win beyond fixing bugs: multiple teams can now pull from the <em>same</em> Silver datasets instead of each building their own version of the truth. That alone eliminates an enormous amount of duplicate work and conflicting numbers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-silver-looks-like-in-storage">What Silver looks like in storage<a href="https://www.recodehive.com/blog/medallion-architecture#what-silver-looks-like-in-storage" class="hash-link" aria-label="Direct link to What Silver looks like in storage" title="Direct link to What Silver looks like in storage" translate="no">​</a></h3>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">adls-gen2/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  └── silver/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        └── sales/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              ├── _delta_log/     ← Delta Lake transaction log</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              ├── part-00000.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              └── part-00001.parquet</span><br></div></code></pre></div></div>
<p>Unlike Bronze (raw files), Silver is a proper <strong>Delta table</strong> with ACID guarantees, time travel, and schema enforcement.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-gold-built-for-business-not-engineers">🥇 Gold: Built for Business, Not Engineers<a href="https://www.recodehive.com/blog/medallion-architecture#-gold-built-for-business-not-engineers" class="hash-link" aria-label="Direct link to 🥇 Gold: Built for Business, Not Engineers" title="Direct link to 🥇 Gold: Built for Business, Not Engineers" translate="no">​</a></h2>
<p>Gold is what your stakeholders actually see.</p>
<p>This layer takes clean Silver data and shapes it for specific use cases, sales dashboards, executive reports, product metrics. It's aggregated, optimized, and structured for fast queries.</p>
<p>You're not building for flexibility here. You're building for <strong>clarity</strong>.</p>
<blockquote>
<p><strong>Think of it as the finished product on the shelf</strong>, packaged, polished, and ready to use.</p>
</blockquote>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-gold-looks-like-in-practice">What Gold looks like in practice<a href="https://www.recodehive.com/blog/medallion-architecture#what-gold-looks-like-in-practice" class="hash-link" aria-label="Direct link to What Gold looks like in practice" title="Direct link to What Gold looks like in practice" translate="no">​</a></h3>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functions </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> </span><span class="token builtin">sum</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> count</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> avg</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> col</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Read from Silver</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">silver_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"abfss://data@mylake.dfs.core.windows.net/silver/sales/"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Build Gold: monthly revenue by region</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gold_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    silver_df</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">filter</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">col</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"is_valid"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">==</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">True</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">groupBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"region"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"order_date"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">agg</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        count</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"order_id"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">alias</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"total_orders"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token builtin">sum</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"amount"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">alias</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"total_revenue"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        avg</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"amount"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">alias</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"avg_order_value"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">orderBy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"order_date"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"region"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Write to Gold</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    gold_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"overwrite"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"abfss://data@mylake.dfs.core.windows.net/gold/revenue_by_region/"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>The Gold table is what Power BI connects to. Pre-aggregated, fast, shaped exactly for the business question it answers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-gold-looks-like-in-storage">What Gold looks like in storage<a href="https://www.recodehive.com/blog/medallion-architecture#what-gold-looks-like-in-storage" class="hash-link" aria-label="Direct link to What Gold looks like in storage" title="Direct link to What Gold looks like in storage" translate="no">​</a></h3>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">adls-gen2/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  └── gold/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ├── revenue_by_region/      ← one table per business use case</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ├── customer_summary/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        └── product_performance/</span><br></div></code></pre></div></div>
<p>Notice: Gold is not one big table. Each Gold table answers one specific business question.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-actually-matters">Why This Actually Matters<a href="https://www.recodehive.com/blog/medallion-architecture#why-this-actually-matters" class="hash-link" aria-label="Direct link to Why This Actually Matters" title="Direct link to Why This Actually Matters" translate="no">​</a></h2>
<p>Here's what Medallion Architecture would have changed about our Tuesday afternoon incident:</p>
<table><thead><tr><th>Problem we had</th><th>Without Medallion</th><th>With Medallion</th></tr></thead><tbody><tr><td>Duplicate orders from API retry bug</td><td>Silently corrupted revenue reports</td><td>Caught and removed in Silver</td></tr><tr><td>Couldn't find where numbers went wrong</td><td>Four hours of undocumented rabbit holes</td><td>Isolated to exactly one layer</td></tr><tr><td>Re-ingesting data after the fix</td><td>Re-called the API (data had since changed)</td><td>Replayed from Bronze (data preserved)</td></tr><tr><td>Finance and analytics had different numbers</td><td>Both teams built their own transforms</td><td>Both teams use the same Silver table</td></tr><tr><td>Schema changed upstream, broke pipeline</td><td>Broke everything simultaneously</td><td>Bronze absorbed it, Silver flagged it</td></tr></tbody></table>
<p>The pattern isn't just about organization, it's about <strong>trust</strong>. When your team knows exactly where data came from and how it was transformed at each step, confidence in analytics goes up. Decisions improve. Four-hour debugging sessions stop happening.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="its-not-always-perfect">It's Not Always Perfect<a href="https://www.recodehive.com/blog/medallion-architecture#its-not-always-perfect" class="hash-link" aria-label="Direct link to It's Not Always Perfect" title="Direct link to It's Not Always Perfect" translate="no">​</a></h2>
<p>Let's be honest: Medallion Architecture does add complexity.</p>
<p>More layers = more storage, more pipelines, more things to maintain. For a small team doing simple reporting, it might genuinely be overkill.</p>
<p><strong>It's a great fit when:</strong></p>
<ul>
<li class="">You have multiple data sources with varying quality</li>
<li class="">Multiple teams consume the same data</li>
<li class="">Data quality is non-negotiable</li>
<li class="">Pipelines need to be recoverable and replayable</li>
<li class="">You need to audit exactly where a number came from</li>
</ul>
<p><strong>It's probably overkill when:</strong></p>
<ul>
<li class="">You have one small, clean dataset</li>
<li class="">It's a one-time analysis</li>
<li class="">You're just building a proof of concept</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="beyond-the-three-layers">Beyond the Three Layers<a href="https://www.recodehive.com/blog/medallion-architecture#beyond-the-three-layers" class="hash-link" aria-label="Direct link to Beyond the Three Layers" title="Direct link to Beyond the Three Layers" translate="no">​</a></h2>
<p>In practice, teams often extend the model:</p>
<ul>
<li class=""><strong>Landing / Staging layer</strong> — temporary storage before Bronze, used when data needs to be decrypted, unzipped, or format-converted before it can be stored</li>
<li class=""><strong>Feature layer</strong> — prepared datasets for ML model training, maintained by data science teams on top of Silver</li>
<li class=""><strong>Semantic layer</strong> — business-friendly models sitting between Gold and end users for self-serve BI</li>
</ul>
<p><img decoding="async" loading="lazy" alt="Extended Medallion Architecture with optional Landing, Feature, and Semantic layers" src="https://www.recodehive.com/assets/images/medallion-extended-layers-cbab23c52bb8e9e2e231f12013dcc57b.png" width="1672" height="941" class="img_ev3q"></p>
<p>The three-tier model is a starting point, not a ceiling. The right number of layers is whatever your team actually needs.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-full-folder-structure">The Full Folder Structure<a href="https://www.recodehive.com/blog/medallion-architecture#the-full-folder-structure" class="hash-link" aria-label="Direct link to The Full Folder Structure" title="Direct link to The Full Folder Structure" translate="no">​</a></h2>
<p>Here's what a complete Medallion Architecture implementation looks like in ADLS Gen2:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">adls-gen2/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  └── data/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ├── bronze/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │     ├── sales/2024/01/raw_orders_20240115.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │     └── customers/2024/01/raw_customers_20240115.json</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ├── silver/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │     ├── sales/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │     │     ├── _delta_log/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │     │     └── part-00000.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │     └── customers/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │           ├── _delta_log/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │           └── part-00000.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        │</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        └── gold/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              ├── revenue_by_region/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              ├── customer_summary/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">              └── product_performance/</span><br></div></code></pre></div></div>
<p>This is the exact structure we adopted after the revenue incident. Bronze preserved everything. Silver caught the duplicates. Gold gave the business team numbers they could trust.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-key-lessons">The Key Lessons<a href="https://www.recodehive.com/blog/medallion-architecture#the-key-lessons" class="hash-link" aria-label="Direct link to The Key Lessons" title="Direct link to The Key Lessons" translate="no">​</a></h2>
<p><strong>1. Raw data and report data should never live in the same layer.</strong> The moment raw data flows directly into a dashboard, you've lost the ability to catch errors before they reach stakeholders.</p>
<p><strong>2. Bronze is not a dumping ground, it's a source of truth.</strong> Its value is that it's complete and immutable. The messiness is the point.</p>
<p><strong>3. Most data engineering work happens in Silver.</strong> Deduplication, validation, standardization this is where pipeline quality is actually built.</p>
<p><strong>4. Gold tables are specific, not flexible.</strong> One table per business use case. Pre-aggregated, fast, and shaped exactly for the question it answers.</p>
<p><strong>5. When something breaks, you replay from Bronze.</strong> You never re-ingest from source. Bronze is your checkpoint.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references--further-reading">References &amp; Further Reading<a href="https://www.recodehive.com/blog/medallion-architecture#references--further-reading" class="hash-link" aria-label="Direct link to References &amp; Further Reading" title="Direct link to References &amp; Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://www.databricks.com/glossary/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">Databricks - Medallion Architecture</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion" target="_blank" rel="noopener noreferrer" class="">Microsoft Learn - Medallion Lakehouse Architecture</a></li>
<li class=""><a href="https://docs.delta.io/" target="_blank" rel="noopener noreferrer" class="">Delta Lake - What is Delta Lake?</a></li>
<li class=""><a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Lakehouse vs Data Warehouse</a></li>
<li class=""><a href="https://www.recodehive.com/blog/microsoft-fabric-explained" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Microsoft Fabric: One Platform, One Lake</a></li>
<li class=""><a href="https://www.recodehive.com/blog/azure-storage-options" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Azure Storage &amp; ADLS Gen2</a></li>
<li class=""><a href="https://oneuptime.com/blog/post/2026-02-17-how-to-build-a-data-lakehouse-architecture-on-gcp-using-cloud-storage-dataproc-and-bigquery/view" target="_blank" rel="noopener noreferrer" class="">OneUptime - Build a Data Lakehouse on GCP</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/medallion-architecture#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p>I'm <strong>Aditya Singh Rathore</strong>, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a> — turning hard-won lessons into content anyone can learn from.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Had a similar pipeline disaster? Drop it in the comments I'd love to hear how you solved it.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>medallion-architecture</category>
            <category>data-engineering</category>
            <category>bronze-silver-gold</category>
            <category>data-pipeline</category>
            <category>delta-lake</category>
            <category>spark</category>
            <category>databricks</category>
            <category>microsoft-fabric</category>
            <category>data-quality</category>
        </item>
        <item>
            <title><![CDATA[Azure Data Factory Pipeline: Build Your First ETL in 10 Minutes]]></title>
            <link>https://www.recodehive.com/blog/ETL-pipeline-tutorial</link>
            <guid>https://www.recodehive.com/blog/ETL-pipeline-tutorial</guid>
            <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Azure Data Factory is Microsoft's cloud-native ETL service — a visual, no-code platform for moving and transforming data at scale. This step-by-step guide walks you through building your first real pipeline in under 10 minutes, explaining every concept along the way.]]></description>
            <content:encoded><![CDATA[<p>The first time someone asked me to "build an ETL pipeline," I nodded confidently and then quietly searched "what is ETL" on my second monitor.</p>
<p>Extract. Transform. Load.</p>
<p>Three words that describe something every data team does dozens of times a day — pulling data from somewhere, doing something to it, and putting it somewhere more useful. Simple idea. Historically, painful to implement.</p>
<p>You'd write Python scripts that broke when the source schema changed. You'd schedule them with cron jobs that nobody monitored. You'd debug failures at 2am by reading raw logs.</p>
<p><strong>Azure Data Factory</strong> (ADF) exists to replace all of that with a visual, managed, scalable pipeline service, one where you can build a working ETL in minutes, not days, and monitor it from a dashboard instead of a terminal.</p>
<p>This guide walks you through everything, the concepts, the components, and a complete step-by-step pipeline you can build right now.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-azure-data-factory">What Is Azure Data Factory?<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#what-is-azure-data-factory" class="hash-link" aria-label="Direct link to What Is Azure Data Factory?" title="Direct link to What Is Azure Data Factory?" translate="no">​</a></h2>
<p>Azure Data Factory is Microsoft's cloud-native ETL and data integration service. It lets you build <strong>data pipelines</strong>, workflows that move data from one place to another, transform it along the way, and load it into a destination where it's actually useful.</p>
<p>The key word is <em>visual</em>. ADF gives you a drag-and-drop canvas where you connect activities, configure sources and destinations, and build complex workflows without writing infrastructure code.</p>
<p>Under the hood, it handles:</p>
<ul>
<li class="">Connecting to 90+ data sources (databases, APIs, files, SaaS apps)</li>
<li class="">Moving data at scale using managed compute</li>
<li class="">Scheduling and triggering pipeline runs</li>
<li class="">Monitoring, alerting, and retry logic</li>
</ul>
<p>Think of it as the <strong>orchestration layer</strong> of your Azure data stack, the thing that decides what data moves where, when, and how.</p>
<p><img decoding="async" loading="lazy" alt="Azure Data Factory pipeline canvas showing a Copy Activity connected from Blob Storage source to ADLS Gen2 sink, with linked services and datasets illustrated" src="https://www.recodehive.com/assets/images/adf-pipeline-overview-8047a68f55cc56718249c27c3d20c7d6.png" width="960" height="732" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-4-concepts-you-need-to-know-first">The 4 Concepts You Need to Know First<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#the-4-concepts-you-need-to-know-first" class="hash-link" aria-label="Direct link to The 4 Concepts You Need to Know First" title="Direct link to The 4 Concepts You Need to Know First" translate="no">​</a></h2>
<p>Before you touch the UI, these four concepts need to click. Everything in ADF is built on them.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-linked-service-the-connection">1. Linked Service: The Connection<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#1-linked-service-the-connection" class="hash-link" aria-label="Direct link to 1. Linked Service: The Connection" title="Direct link to 1. Linked Service: The Connection" translate="no">​</a></h3>
<p>A <strong>Linked Service</strong> is a connection string. It tells ADF how to connect to an external resource — a storage account, a database, an API.</p>
<p>Think of it as the key to a door. Before ADF can read from your Blob Storage or write to your SQL database, it needs a Linked Service that holds the credentials and connection details for that resource.</p>
<p>You create a Linked Service once, then reuse it across as many datasets and pipelines as you need.</p>
<iframe width="100%" height="400" src="https://www.youtube.com/embed/EpDkxTHAhOs" title="YouTube video player" style="border:none" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p><strong>Examples:</strong></p>
<ul>
<li class=""><code>AzureStorageLinkedService</code> → connects to your ADLS Gen2 account</li>
<li class=""><code>AzureSqlLinkedService</code> → connects to your Azure SQL Database</li>
<li class=""><code>RestApiLinkedService</code> → connects to an external HTTP API</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-dataset-the-pointer">2. Dataset: The Pointer<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#2-dataset-the-pointer" class="hash-link" aria-label="Direct link to 2. Dataset: The Pointer" title="Direct link to 2. Dataset: The Pointer" translate="no">​</a></h3>
<p>A <strong>Dataset</strong> points to the specific data within a Linked Service.</p>
<p>If the Linked Service is the key to the building, the Dataset is the directions to a specific room inside it. It tells ADF: <em>"The data I care about is in this container, in this folder, in this file format."</em></p>
<p><strong>Examples:</strong></p>
<ul>
<li class="">A Dataset pointing to <code>bronze/sales/2024/jan/*.csv</code> in your ADLS Gen2 account</li>
<li class="">A Dataset pointing to the <code>[dbo].[orders]</code> table in your SQL database</li>
<li class="">A Dataset describing a Parquet file with a known schema</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-activity-the-work">3. Activity: The Work<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#3-activity-the-work" class="hash-link" aria-label="Direct link to 3. Activity: The Work" title="Direct link to 3. Activity: The Work" translate="no">​</a></h3>
<p>An <strong>Activity</strong> is a single step of work inside a pipeline. ADF has three categories:</p>
<ul>
<li class=""><strong>Data Movement</strong> — Copy data from source to destination. The <strong>Copy Activity</strong> is the most common one you'll use.</li>
<li class=""><strong>Data Transformation</strong> — Transform data using Mapping Data Flows, Databricks notebooks, or stored procedures.</li>
<li class=""><strong>Control Flow</strong> — Logic and orchestration: If/Else conditions, ForEach loops, Wait activities, Execute Pipeline (call another pipeline).</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-pipeline--the-workflow">4. Pipeline — The Workflow<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#4-pipeline--the-workflow" class="hash-link" aria-label="Direct link to 4. Pipeline — The Workflow" title="Direct link to 4. Pipeline — The Workflow" translate="no">​</a></h3>
<p>A <strong>Pipeline</strong> is a logical grouping of activities that together perform a unit of work.</p>
<p>Your pipeline might have three activities: a Copy Activity to land raw data, a Data Flow activity to clean it, and a Stored Procedure activity to update a watermark table. Together they form one repeatable workflow.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-etl-flow-in-adf-visualised">The ETL Flow in ADF: Visualised<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#the-etl-flow-in-adf-visualised" class="hash-link" aria-label="Direct link to The ETL Flow in ADF: Visualised" title="Direct link to The ETL Flow in ADF: Visualised" translate="no">​</a></h2>
<p>Here's how all four concepts connect in a real pipeline:</p>
<p><img decoding="async" loading="lazy" alt="End-to-end ADF ETL flow showing: REST API source → Linked Service → Dataset → Copy Activity → Dataset → Linked Service → ADLS Gen2 sink. Below the flow: Trigger icon labeled &amp;quot;Scheduled: daily 2am&amp;quot;. All inside a Pipeline box." src="https://www.recodehive.com/assets/images/adf-elt-flow-5391f0d696267b8fb0bafbd3fff7ad99.png" width="1186" height="813" class="img_ev3q"></p>
<img alt="ADF pipeline diagram" width="500" height="50">
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="build-your-first-pipeline-step-by-step">Build Your First Pipeline: Step by Step<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#build-your-first-pipeline-step-by-step" class="hash-link" aria-label="Direct link to Build Your First Pipeline: Step by Step" title="Direct link to Build Your First Pipeline: Step by Step" translate="no">​</a></h2>
<p>Let's build a real pipeline: copy a CSV file from Azure Blob Storage into ADLS Gen2, landing it in a <code>bronze/</code> folder.</p>
<p><strong>What you need before starting:</strong></p>
<ul>
<li class="">An Azure account (free trial works fine)</li>
<li class="">A Storage Account with hierarchical namespace enabled (ADLS Gen2)</li>
<li class="">A CSV file uploaded to a container called <code>source/</code></li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-1-create-an-azure-data-factory">Step 1: Create an Azure Data Factory<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#step-1-create-an-azure-data-factory" class="hash-link" aria-label="Direct link to Step 1: Create an Azure Data Factory" title="Direct link to Step 1: Create an Azure Data Factory" translate="no">​</a></h3>
<ol>
<li class="">Go to the <a href="https://portal.azure.com/" target="_blank" rel="noopener noreferrer" class="">Azure Portal</a></li>
<li class="">Search for <strong>Data Factory</strong> → click <strong>Create</strong></li>
<li class="">Fill in the details:<!-- -->
<ul>
<li class="">Resource Group: your existing one or create new</li>
<li class="">Name: <code>sales-data-factory</code> (must be globally unique)</li>
<li class="">Region: same as your storage account</li>
</ul>
</li>
<li class="">Click <strong>Review + Create</strong> → <strong>Create</strong></li>
<li class="">Once deployed, click <strong>Launch Studio</strong></li>
</ol>
<p>You're now in <strong>ADF Studio</strong>, the visual authoring environment.</p>
<p><img decoding="async" loading="lazy" alt="step_1" src="https://www.recodehive.com/assets/images/step-1-2d42ff51adbbfbff732b6ca733a9b62e.png" width="958" height="873" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-2-create-a-linked-service-for-your-storage-account">Step 2: Create a Linked Service for Your Storage Account<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#step-2-create-a-linked-service-for-your-storage-account" class="hash-link" aria-label="Direct link to Step 2: Create a Linked Service for Your Storage Account" title="Direct link to Step 2: Create a Linked Service for Your Storage Account" translate="no">​</a></h3>
<ol>
<li class="">In ADF Studio, click <strong>Manage</strong> (toolbox icon, left sidebar)</li>
<li class="">Click <strong>Linked Services</strong> → <strong>New</strong></li>
<li class="">Search for <strong>Azure Data Lake Storage Gen2</strong> → Select → Continue</li>
<li class="">Fill in:<!-- -->
<ul>
<li class="">Name: <code>ADLSGen2LinkedService</code></li>
<li class="">Authentication: Account Key (simplest for now)</li>
<li class="">Storage Account: select yours from the dropdown</li>
</ul>
</li>
<li class="">Click <strong>Test Connection</strong> — you should see ✅ Connection successful</li>
<li class="">Click <strong>Create</strong>!</li>
</ol>
<p><img decoding="async" loading="lazy" alt="ADF Studio Linked Service creation screen showing ADLS Gen2 selected with connection test successful" src="https://www.recodehive.com/assets/images/adf-linked-service-8211855fbddd512fc01315d6e0b09d0e.png" width="777" height="875" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-3-create-the-source-dataset">Step 3: Create the Source Dataset<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#step-3-create-the-source-dataset" class="hash-link" aria-label="Direct link to Step 3: Create the Source Dataset" title="Direct link to Step 3: Create the Source Dataset" translate="no">​</a></h3>
<p>This dataset points to the CSV file in your <code>source/</code> container.</p>
<ol>
<li class="">Click <strong>Author</strong> (pencil icon, left sidebar)</li>
<li class="">Click <strong>+</strong> → <strong>Dataset</strong></li>
<li class="">Search for <strong>Azure Data Lake Storage Gen2</strong> → Continue</li>
<li class="">Select <strong>Delimited Text</strong> (CSV format) → Continue</li>
<li class="">Fill in:<!-- -->
<ul>
<li class="">Name: <code>SourceCSVDataset</code></li>
<li class="">Linked Service: <code>ADLSGen2LinkedService</code></li>
<li class="">File path: <code>source/</code> → browse and select your CSV file</li>
<li class="">First row as header: ✅ checked</li>
</ul>
</li>
<li class="">Click <strong>OK</strong></li>
</ol>
<p><img decoding="async" loading="lazy" alt="adf_datasets" src="https://www.recodehive.com/assets/images/adf-dataset-80ced611ee690549c8a2317ec5095da2.png" width="1472" height="767" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-4-create-the-sink-dataset">Step 4: Create the Sink Dataset<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#step-4-create-the-sink-dataset" class="hash-link" aria-label="Direct link to Step 4: Create the Sink Dataset" title="Direct link to Step 4: Create the Sink Dataset" translate="no">​</a></h3>
<p>This dataset points to where the file should land, your <code>bronze/</code> folder.</p>
<ol>
<li class="">Click <strong>+</strong> → <strong>Dataset</strong> again</li>
<li class="">Same steps — <strong>Azure Data Lake Storage Gen2</strong> → <strong>Delimited Text</strong></li>
<li class="">Fill in:<!-- -->
<ul>
<li class="">Name: <code>BronzeCSVDataset</code></li>
<li class="">Linked Service: <code>ADLSGen2LinkedService</code></li>
<li class="">File path: <code>bronze/sales/</code> (type this manually, it doesn't need to exist yet, ADF will create it)</li>
</ul>
</li>
<li class="">Click <strong>OK</strong></li>
</ol>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-5-build-the-pipeline">Step 5: Build the Pipeline<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#step-5-build-the-pipeline" class="hash-link" aria-label="Direct link to Step 5: Build the Pipeline" title="Direct link to Step 5: Build the Pipeline" translate="no">​</a></h3>
<ol>
<li class="">Click <strong>+</strong> → <strong>Pipeline</strong> → name it <code>CopySalesToBronze</code></li>
<li class="">From the <strong>Activities</strong> panel on the left, expand <strong>Move &amp; Transform</strong></li>
<li class="">Drag <strong>Copy data</strong> onto the canvas</li>
<li class="">Click the Copy Activity to open its settings:</li>
</ol>
<p><strong>Source tab:</strong></p>
<ul>
<li class="">Source dataset: <code>SourceCSVDataset</code></li>
</ul>
<p><strong>Sink tab:</strong></p>
<ul>
<li class="">Sink dataset: <code>BronzeCSVDataset</code></li>
<li class="">Copy behavior: <code>PreserveHierarchy</code></li>
</ul>
<p><strong>Mapping tab:</strong></p>
<ul>
<li class="">Click <strong>Import schemas</strong> - ADF reads your CSV headers and maps columns automatically</li>
</ul>
<ol start="5">
<li class="">Click <strong>Validate</strong> (toolbar) - you should see no errors</li>
<li class="">Click <strong>Debug</strong> - this runs the pipeline immediately without publishing</li>
</ol>
<p><img decoding="async" loading="lazy" alt="ADF pipeline canvas showing Copy Activity with Source and Sink configured, Debug button highlighted in toolbar" src="https://www.recodehive.com/assets/images/adf-pipeline-debug-38367815446634916a1c4345ac79ebe5.png" width="1255" height="877" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-6-publish-and-add-a-trigger">Step 6: Publish and Add a Trigger<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#step-6-publish-and-add-a-trigger" class="hash-link" aria-label="Direct link to Step 6: Publish and Add a Trigger" title="Direct link to Step 6: Publish and Add a Trigger" translate="no">​</a></h3>
<p>Once Debug runs successfully:</p>
<ol>
<li class="">Click <strong>Publish All</strong> (top toolbar) - this saves everything to ADF</li>
<li class="">Click <strong>Add trigger</strong> → <strong>New/Edit</strong></li>
<li class="">Click <strong>New</strong> → configure:<!-- -->
<ul>
<li class="">Type: <strong>Schedule</strong></li>
<li class="">Start: today's date</li>
<li class="">Recurrence: <strong>Every 1 Day</strong> at <code>02:00 AM</code></li>
</ul>
</li>
<li class="">Click <strong>OK</strong> → <strong>OK</strong></li>
<li class="">Click <strong>Publish All</strong> again</li>
</ol>
<p>Your pipeline now runs automatically every night at 2am, copying new sales data into your bronze layer.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-7-monitor-your-pipeline">Step 7: Monitor Your Pipeline<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#step-7-monitor-your-pipeline" class="hash-link" aria-label="Direct link to Step 7: Monitor Your Pipeline" title="Direct link to Step 7: Monitor Your Pipeline" translate="no">​</a></h3>
<ol>
<li class="">Click <strong>Monitor</strong> (chart icon, left sidebar)</li>
<li class="">You'll see all pipeline runs - status, duration, rows copied</li>
<li class="">Click any run to see activity-level details</li>
<li class="">If something fails, click the error icon to see exactly which activity failed and why</li>
</ol>
<p><img decoding="async" loading="lazy" alt="ADF Monitor tab showing pipeline run history with status, duration, and rows copied columns" src="https://www.recodehive.com/assets/images/adf-monitor-577cdb42a742c96c8d4b4a2fdb1cccde.png" width="1921" height="880" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-just-happened-the-full-picture">What Just Happened: The Full Picture<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#what-just-happened-the-full-picture" class="hash-link" aria-label="Direct link to What Just Happened: The Full Picture" title="Direct link to What Just Happened: The Full Picture" translate="no">​</a></h2>
<p>Let's step back and look at what you built:</p>
<img alt="ADF end-to-end ETL flow" width="500" height="50">
<p>This is the <strong>Extract and Load</strong> part of ETL. The file is extracted from the source container and loaded into the bronze layer, untouched, exactly as it arrived.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-comes-next-transform">What Comes Next: Transform<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#what-comes-next-transform" class="hash-link" aria-label="Direct link to What Comes Next: Transform" title="Direct link to What Comes Next: Transform" translate="no">​</a></h2>
<p>The pipeline you built moves data. To transform it, you add one of two things after the Copy Activity:</p>
<p><strong>Option 1 — Mapping Data Flow</strong> (no-code)
A visual transformation canvas inside ADF. Drag and drop Filter, Join, Aggregate, Derived Column activities. Runs on Spark under the hood. Great for teams that don't want to write code.</p>
<p><strong>Option 2 — Databricks Notebook Activity</strong>
Call an existing Databricks notebook from your ADF pipeline. The notebook runs your Python/Spark transformation logic and writes cleaned data to the silver layer. Best for complex transformations that need code.</p>
<p>The full Medallion Architecture flow in ADF looks like this:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Source API / Database</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Copy Activity → bronze/ (raw data, as-is)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Mapping Data Flow / Databricks Notebook → silver/ (cleaned, validated)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Mapping Data Flow / Databricks Notebook → gold/ (aggregated, business-ready)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Power BI DirectLake → Dashboard</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="triggers-when-does-your-pipeline-run">Triggers: When Does Your Pipeline Run?<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#triggers-when-does-your-pipeline-run" class="hash-link" aria-label="Direct link to Triggers: When Does Your Pipeline Run?" title="Direct link to Triggers: When Does Your Pipeline Run?" translate="no">​</a></h2>
<p>ADF gives you three trigger types:</p>
<table><thead><tr><th>Trigger Type</th><th>When it fires</th><th>Use case</th></tr></thead><tbody><tr><td><strong>Schedule</strong></td><td>At a fixed time/frequency</td><td>Nightly batch loads</td></tr><tr><td><strong>Tumbling Window</strong></td><td>Fixed intervals with state</td><td>Hourly incremental loads</td></tr><tr><td><strong>Storage Event</strong></td><td>When a file arrives in storage</td><td>File-arrival driven pipelines</td></tr><tr><td><strong>Manual</strong></td><td>On demand</td><td>One-time loads, testing</td></tr></tbody></table>
<p>For production pipelines, <strong>Storage Event triggers</strong> are the most powerful, your pipeline fires automatically the moment a new file lands in your container, with no polling or scheduling lag.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="common-mistakes-beginners-make">Common Mistakes Beginners Make<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#common-mistakes-beginners-make" class="hash-link" aria-label="Direct link to Common Mistakes Beginners Make" title="Direct link to Common Mistakes Beginners Make" translate="no">​</a></h2>
<p><strong>1. Using the same Linked Service for every environment</strong>
Create separate Linked Services for dev, staging, and production. Use ADF's <strong>parameterisation</strong> to swap them out without changing pipeline logic.</p>
<p><strong>2. Not testing with Debug before publishing</strong>
Always Debug first. Publishing without testing means failures hit production. Debug runs don't count against your trigger history.</p>
<p><strong>3. Hardcoding file paths in datasets</strong>
Parameterise your datasets so the same pipeline can process different files dynamically. One pipeline, many files, not one pipeline per file.</p>
<p><strong>4. No monitoring alerts</strong>
Set up Azure Monitor alerts for pipeline failures. You shouldn't find out a pipeline failed when someone asks why last night's data is missing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-takeaways">Key Takeaways<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#key-takeaways" class="hash-link" aria-label="Direct link to Key Takeaways" title="Direct link to Key Takeaways" translate="no">​</a></h2>
<p><strong>1. ADF is built on four concepts.</strong> Linked Services (connections), Datasets (pointers), Activities (work), Pipelines (workflows). Everything else is a variation of these four.</p>
<p><strong>2. The Copy Activity is your workhorse.</strong> It supports 90+ source/sink combinations and handles schema mapping, file format conversion, and retry logic out of the box.</p>
<p><strong>3. ADF is the orchestration layer, not the transformation layer.</strong> For heavy transformations, ADF calls Databricks or Data Flows, it doesn't do the transformation itself.</p>
<p><strong>4. Triggers make pipelines production-ready.</strong> A pipeline without a trigger is just a script you run manually. Add a trigger and it becomes infrastructure.</p>
<p><strong>5. ADF fits naturally into Medallion Architecture.</strong> Copy Activity lands data in bronze. Data Flows or Databricks jobs process silver and gold. ADF orchestrates the whole sequence.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references--further-reading">References &amp; Further Reading<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#references--further-reading" class="hash-link" aria-label="Direct link to References &amp; Further Reading" title="Direct link to References &amp; Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/data-factory/introduction" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs: Introduction to Azure Data Factory</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs: Copy Activity in ADF</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/data-factory/tutorial-copy-data-portal" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - ADF Tutorial: Copy data using Azure portal</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs: Mapping Data Flows</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs: Triggers in ADF</a></li>
<li class=""><a href="https://www.recodehive.com/blog/azure-storage-options" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Azure Storage &amp; ADLS Gen2: Where Does Your Data Actually Live?</a></li>
<li class=""><a href="https://www.recodehive.com/blog/microsoft-fabric-explained" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Microsoft Fabric: One Platform, One Lake</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/ETL-pipeline-tutorial#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p>I'm <strong>Aditya Singh Rathore</strong>, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a> breaking down complex concepts into things you can actually use.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Stuck on a specific ADF activity or pipeline pattern? Drop your question in the comments.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>azure-data-factory</category>
            <category>adf</category>
            <category>etl</category>
            <category>data-pipeline</category>
            <category>data-engineering</category>
            <category>azure</category>
            <category>blob-storage</category>
            <category>adls</category>
            <category>copy-activity</category>
            <category>linked-service</category>
            <category>dataset</category>
            <category>trigger</category>
        </item>
        <item>
            <title><![CDATA[Azure Storage & ADLS Gen2: Where Does Your Data Actually Live?]]></title>
            <link>https://www.recodehive.com/blog/azure-storage-options</link>
            <guid>https://www.recodehive.com/blog/azure-storage-options</guid>
            <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Every Azure data pipeline needs a place to store data. But Azure gives you four different storage types and choosing the wrong one is easier than you think. This guide explains all four, shows how they work together in a real pipeline, and goes deep on ADLS Gen2, the storage layer that powers modern Azure data engineering.]]></description>
            <content:encoded><![CDATA[<p>My first week working with Azure, I broke a pipeline before it even started.</p>
<p>I had a simple job: land some raw CSV files from a sales API into Azure so a Spark job could pick them up later. I searched "Azure storage", saw five different options staring back at me, panicked slightly, and clicked the first one that sounded sensible - <strong>Azure Table Storage</strong>.</p>
<p>Three hours later, I was staring at an error I didn't understand, in a service that was never designed for files.</p>
<p>Table Storage is a NoSQL key-value store. It stores entities and properties, not CSV files. My data had nowhere to go.</p>
<p>That confusion is more common than most Azure tutorials admit. And it happens because nobody explains the one question that actually matters before anything else:</p>
<p><strong>Where does your data actually live in Azure and why?</strong></p>
<p>This blog answers that. We'll walk through all four Azure storage types, show exactly where each one fits in a real data pipeline, and then go deep on the one that changes everything for data engineering: <strong>Azure Data Lake Storage Gen2</strong>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="azure-has-four-storage-types-heres-the-map">Azure Has Four Storage Types. Here's the Map.<a href="https://www.recodehive.com/blog/azure-storage-options#azure-has-four-storage-types-heres-the-map" class="hash-link" aria-label="Direct link to Azure Has Four Storage Types. Here's the Map." title="Direct link to Azure Has Four Storage Types. Here's the Map." translate="no">​</a></h2>
<p>Before we build anything, let's get oriented.</p>
<p>Azure bundles all storage services under a single <strong>Storage Account</strong>, one entry point, one namespace, one billing account. Inside that account, you get access to four distinct storage services, each built for a different job.</p>
<p><img decoding="async" loading="lazy" alt="Four Azure storage types shown as rooms in a building — Blob (file cabinet), Queue (mailbox), Table (ledger), File (shared drive) with one-line descriptions of each" src="https://www.recodehive.com/assets/images/azure-storage-four-types-7259eea2fa1ef69eff0b53603aa6b00d.png" width="1672" height="941" class="img_ev3q"></p>
<p>Here's the quick map before we go deeper:</p>
<table><thead><tr><th>Storage Type</th><th>Think of it as</th><th>Stores</th><th>Used in pipelines for</th></tr></thead><tbody><tr><td><strong>Blob Storage</strong></td><td>A file cabinet</td><td>Any file CSV, JSON, Parquet, images, logs</td><td>Raw data landing zone</td></tr><tr><td><strong>Queue Storage</strong></td><td>A mailbox</td><td>Messages between services</td><td>Triggering pipeline steps</td></tr><tr><td><strong>Table Storage</strong></td><td>A ledger</td><td>Structured key-value rows</td><td>Tracking run state, metadata</td></tr><tr><td><strong>File Storage</strong></td><td>A shared network drive</td><td>Files accessed over SMB</td><td>Legacy app file shares</td></tr></tbody></table>
<p>None of these is "better." They serve different stages of the same pipeline. The mistake most beginners make, including me is picking one at random instead of understanding the job each one does.</p>
<p>Let's walk through them in the order they matter for a real data engineering workflow.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="blob-storage-the-foundation-of-everything">Blob Storage: The Foundation of Everything<a href="https://www.recodehive.com/blog/azure-storage-options#blob-storage-the-foundation-of-everything" class="hash-link" aria-label="Direct link to Blob Storage: The Foundation of Everything" title="Direct link to Blob Storage: The Foundation of Everything" translate="no">​</a></h2>
<p>When data arrives in Azure, it almost always lands in <strong>Blob Storage</strong> first.</p>
<p>Blob stands for <strong>Binary Large Object</strong> which is just a fancy way of saying "any file." CSV, JSON, Parquet, images, videos, audio, ZIP archives, raw log dumps, Blob Storage holds all of it without caring about structure or format.</p>
<p>There's no schema enforcement, no type checking. You put a file in, you get it back out. At any scale.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-three-blob-types">The three blob types<a href="https://www.recodehive.com/blog/azure-storage-options#the-three-blob-types" class="hash-link" aria-label="Direct link to The three blob types" title="Direct link to The three blob types" translate="no">​</a></h3>
<p>Depending on how your data is written, you'll use one of three blob types:</p>
<p><img decoding="async" loading="lazy" alt="blob_types" src="https://www.recodehive.com/assets/images/blob_types-4fed81d21d21a138e0066418d5165aed.png" width="1672" height="941" class="img_ev3q"></p>
<ul>
<li class=""><strong>Block Blob :</strong> Upload a file all at once. This covers 95% of data engineering use cases, your CSVs, Parquet files, JSON exports all go here.</li>
<li class=""><strong>Append Blob :</strong> Add data continuously without modifying what's already there. Perfect for log files that grow over time.</li>
<li class=""><strong>Page Blob :</strong> Optimised for random read/write operations. Used mainly for VM disks. You'll rarely touch this directly.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="access-tiers-storage-that-adjusts-to-how-often-you-actually-need-the-data">Access tiers: storage that adjusts to how often you actually need the data<a href="https://www.recodehive.com/blog/azure-storage-options#access-tiers-storage-that-adjusts-to-how-often-you-actually-need-the-data" class="hash-link" aria-label="Direct link to Access tiers: storage that adjusts to how often you actually need the data" title="Direct link to Access tiers: storage that adjusts to how often you actually need the data" translate="no">​</a></h3>
<p>One of Blob Storage's most underrated features is <strong>access tiering</strong>:</p>
<ul>
<li class=""><strong>Hot :</strong> Data you access daily. Higher storage cost, lowest read cost.</li>
<li class=""><strong>Cool :</strong> Data you access occasionally. Cheaper to store, slightly more to read. 30-day minimum.</li>
<li class=""><strong>Archive :</strong> Data you almost never access. Extremely cheap to store, but takes hours to retrieve. Think old compliance records.</li>
</ul>
<p>You can set <strong>lifecycle policies</strong> to move data automatically between tiers as it ages. Last month's raw files move from hot to cool. Last year's move to archive. You save money without touching anything manually.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-blob-storage-fits-in-a-pipeline">Where Blob Storage fits in a pipeline<a href="https://www.recodehive.com/blog/azure-storage-options#where-blob-storage-fits-in-a-pipeline" class="hash-link" aria-label="Direct link to Where Blob Storage fits in a pipeline" title="Direct link to Where Blob Storage fits in a pipeline" translate="no">​</a></h3>
<p>In Medallion Architecture, Blob Storage is the natural home for the <strong>Bronze layer</strong>, the raw, unprocessed data exactly as it arrived from source systems. Nothing is cleaned. Nothing is validated. It just lands and waits.</p>
<p>But here's where things get interesting.</p>
<p>Plain Blob Storage works perfectly for general file storage. But for big data analytics pipelines, the kind where you're processing millions of files, running Spark jobs, and building Bronze/Silver/Gold layers, it has a critical limitation that most tutorials don't mention until you've already hit it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-problem-with-plain-blob-storage-at-scale">The Problem with Plain Blob Storage at Scale<a href="https://www.recodehive.com/blog/azure-storage-options#the-problem-with-plain-blob-storage-at-scale" class="hash-link" aria-label="Direct link to The Problem with Plain Blob Storage at Scale" title="Direct link to The Problem with Plain Blob Storage at Scale" translate="no">​</a></h2>
<p>Here's something I found out the hard way six months into working with Azure pipelines.</p>
<p>I had a container full of raw sales data — about 40,000 Parquet files organised under a path that looked like <code>raw/2024/</code>. My team decided to rename it to <code>bronze/2024/</code> to match our Medallion Architecture convention. Simple enough, right?</p>
<p>It took <strong>47 minutes</strong>.</p>
<p>Not because Azure was slow. Because what looked like a folder called <code>raw/</code> was never actually a folder. In plain Blob Storage, everything lives at the same flat level, the slashes in a path like
<code>raw/2024/jan/file.parquet</code> are just characters in a key name, the same way a filename on your desktop could technically be called <code>raw-2024-jan-file.parquet</code> with dashes instead.</p>
<p>There is no directory underneath. So renaming means Azure copies each file to the new key name and deletes the old one,one file at a time, 40,000 times in a row.</p>
<p>At big data scale where you're managing millions of files across Bronze, Silver, and Gold layers that's not a minor inconvenience. It's a pipeline blocker.</p>
<p>This is the exact problem <strong>ADLS Gen2</strong> was built to fix.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="adls-gen2-blob-storage-evolved">ADLS Gen2: Blob Storage, Evolved<a href="https://www.recodehive.com/blog/azure-storage-options#adls-gen2-blob-storage-evolved" class="hash-link" aria-label="Direct link to ADLS Gen2: Blob Storage, Evolved" title="Direct link to ADLS Gen2: Blob Storage, Evolved" translate="no">​</a></h2>
<p><strong>Azure Data Lake Storage Gen2 (ADLS Gen2)</strong> is not a separate service. It's Blob Storage with one critical feature enabled: the <strong>Hierarchical Namespace</strong>.</p>
<p>With hierarchical namespace turned on, folders become real. A directory with ten million files inside it can be renamed or deleted in a <strong>single atomic operation</strong>, instant, regardless of how many files it contains.</p>
<p>That one change makes ADLS Gen2 fast enough for serious analytics workloads. It's the storage layer that Databricks, Synapse, Azure Data Factory, and Microsoft Fabric are all built to work with.</p>
<p><img decoding="async" loading="lazy" alt="Side-by-side comparison of plain Blob Storage (flat key names, fake folders) vs ADLS Gen2 (real directory tree with Bronze/Silver/Gold layers). Rename operation shown on both sides — slow/sequential on left, instant/atomic on right." src="https://www.recodehive.com/assets/images/blob-vs-adls-comparison-1c14b299fa4e9216f86b6383977ff88e.png" width="1536" height="1024" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-full-adls-gen2-structure">The full ADLS Gen2 structure<a href="https://www.recodehive.com/blog/azure-storage-options#the-full-adls-gen2-structure" class="hash-link" aria-label="Direct link to The full ADLS Gen2 structure" title="Direct link to The full ADLS Gen2 structure" translate="no">​</a></h3>
<p>ADLS Gen2 organises data in three real levels:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Storage Account</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    └── Container (called a File System in ADLS)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            └── Directories (real, nested folders)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    └── Files (your actual data)</span><br></div></code></pre></div></div>
<p>In practice, for a Medallion Architecture pipeline:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">my-datalake/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    └── data/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ├── bronze/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            │     └── sales/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            │           └── 2024/jan/raw_orders.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            ├── silver/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            │     └── sales/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            │           └── 2024/jan/cleaned_orders.parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">            └── gold/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                  └── sales/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                        └── 2024/jan/monthly_revenue.parquet</span><br></div></code></pre></div></div>
<p>Bronze, Silver, Gold are real directories. Spark jobs move data between them. ADF pipelines write to them. Power BI reads from them. The Medallion pattern isn't an abstract concept it's a folder structure in ADLS Gen2 with transformation logic connecting the layers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-abfs-driver-why-this-matters-for-spark">The ABFS driver: why this matters for Spark<a href="https://www.recodehive.com/blog/azure-storage-options#the-abfs-driver-why-this-matters-for-spark" class="hash-link" aria-label="Direct link to The ABFS driver: why this matters for Spark" title="Direct link to The ABFS driver: why this matters for Spark" translate="no">​</a></h3>
<p>When Spark, Databricks, Synapse, or Fabric connect to ADLS Gen2, they use the <strong>Azure Blob File System (ABFS) driver</strong>, accessed via the <code>abfss://</code> protocol.</p>
<p>This driver was purpose-built for analytics workloads. It's significantly faster than the old WASB driver for directory-heavy operations, and it's the reason tools like Databricks can list, read, and write millions of files in ADLS Gen2 efficiently.</p>
<p>Every time you see <code>abfss://container@storageaccount.dfs.core.windows.net/</code> in a notebook or pipeline config, that's ADLS Gen2 being accessed via the ABFS driver.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="fine-grained-access-control-with-posix-acls">Fine-grained access control with POSIX ACLs<a href="https://www.recodehive.com/blog/azure-storage-options#fine-grained-access-control-with-posix-acls" class="hash-link" aria-label="Direct link to Fine-grained access control with POSIX ACLs" title="Direct link to Fine-grained access control with POSIX ACLs" translate="no">​</a></h3>
<p>Regular Blob Storage gives you Role-Based Access Control (RBAC) at the container level. ADLS Gen2 goes further with <a href="https://www.komprise.com/glossary_terms/posix-acls/" target="_blank" rel="noopener noreferrer" class=""><strong>POSIX-style Access Control Lists (ACLs)</strong></a>, the same permission model used in Linux file systems.</p>
<p>This means you can grant a data science team read access to only the <code>silver/</code> directory, without exposing <code>bronze/</code> (raw, potentially sensitive data) or <code>gold/</code> (business metrics). Fine-grained, at the folder and file level.</p>
<p>For regulated industries - finance, healthcare, government, this isn't a nice-to-have. It's a requirement.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="storage-tiers-work-at-directory-level">Storage tiers work at directory level<a href="https://www.recodehive.com/blog/azure-storage-options#storage-tiers-work-at-directory-level" class="hash-link" aria-label="Direct link to Storage tiers work at directory level" title="Direct link to Storage tiers work at directory level" translate="no">​</a></h3>
<p>Just like Blob Storage, ADLS Gen2 supports Hot, Cool, and Archive tiers. But now you can apply lifecycle policies at the <strong>directory level</strong> automatically archiving <code>bronze/2023/</code> partitions when they're more than a year old, while keeping <code>bronze/2024/</code> hot for active pipeline use.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="adls-gen2-is-what-onelake-is-built-on">ADLS Gen2 is what OneLake is built on<a href="https://www.recodehive.com/blog/azure-storage-options#adls-gen2-is-what-onelake-is-built-on" class="hash-link" aria-label="Direct link to ADLS Gen2 is what OneLake is built on" title="Direct link to ADLS Gen2 is what OneLake is built on" translate="no">​</a></h3>
<p>If you've read about <a href="https://www.recodehive.com/blog/microsoft-fabric-explained" target="_blank" rel="noopener noreferrer" class="">Microsoft Fabric</a>, you know that OneLake is Fabric's unified data lake, the single storage layer that every Fabric workload reads from and writes to.</p>
<p>OneLake is fundamentally ADLS Gen2 with a unified namespace across your entire Fabric workspace. Understanding ADLS Gen2 means you understand the storage engine that powers Fabric, Synapse, Databricks on Azure, and every serious Azure data platform.</p>
<table><thead><tr><th>Azure Service</th><th>How it uses ADLS Gen2</th></tr></thead><tbody><tr><td><strong>Azure Data Factory</strong></td><td>Reads source files, writes pipeline outputs</td></tr><tr><td><strong>Azure Databricks</strong></td><td>Reads/writes Delta tables via ABFS driver</td></tr><tr><td><strong>Azure Synapse Analytics</strong></td><td>Queries files directly with SQL serverless</td></tr><tr><td><strong>Microsoft Fabric / OneLake</strong></td><td>OneLake IS ADLS Gen2 unified namespace</td></tr><tr><td><strong>Azure Machine Learning</strong></td><td>Stores training datasets and model artifacts</td></tr><tr><td><strong>Power BI</strong></td><td>DirectLake mode reads Delta files from ADLS Gen2</td></tr></tbody></table>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-supporting-cast-queue-and-table-storage">The Supporting Cast: Queue and Table Storage<a href="https://www.recodehive.com/blog/azure-storage-options#the-supporting-cast-queue-and-table-storage" class="hash-link" aria-label="Direct link to The Supporting Cast: Queue and Table Storage" title="Direct link to The Supporting Cast: Queue and Table Storage" translate="no">​</a></h2>
<p>ADLS Gen2 stores your data. But a pipeline isn't just storage, it's coordination, state management, and event triggering. That's where Queue Storage and Table Storage come in.</p>
<p>They're not glamorous. But remove them from a production pipeline and things fall apart quickly.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="queue-storage-the-pipeline-trigger">Queue Storage: The Pipeline Trigger<a href="https://www.recodehive.com/blog/azure-storage-options#queue-storage-the-pipeline-trigger" class="hash-link" aria-label="Direct link to Queue Storage: The Pipeline Trigger" title="Direct link to Queue Storage: The Pipeline Trigger" translate="no">​</a></h3>
<p>Queue Storage stores <strong>messages</strong>, small packets of information passed between services asynchronously.</p>
<p><img decoding="async" loading="lazy" alt="queue_storage" src="https://www.recodehive.com/assets/images/queue_storage-a678ab069e4d9fd952e33fde26cfcd2f.png" width="1672" height="941" class="img_ev3q"></p>
<p>In a data pipeline context, Queue Storage is typically used as a <strong>trigger mechanism</strong>. When a new file lands in ADLS Gen2, Azure Blob Storage can emit an event that drops a message into a Queue. Azure Data Factory (or an Azure Function) listens to that Queue and kicks off the pipeline automatically.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">New file lands in ADLS Gen2 bronze/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    → Event triggers a Queue message: "new file: sales_2024_jan.parquet"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    → ADF pipeline picks up the message</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    → Pipeline runs transformation</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    → Cleaned data written to silver/</span><br></div></code></pre></div></div>
<p>Without Queue Storage, you'd either poll for new files on a schedule (wasteful) or trigger pipelines manually (not scalable).</p>
<p><strong>Key facts:</strong></p>
<ul>
<li class="">Messages up to <strong>64 KB</strong> in size</li>
<li class="">Queue holds up to <strong>200 TB</strong> of messages</li>
<li class="">Messages expire after <strong>7 days</strong> if unconsumed</li>
<li class="">Built-in retry logic if a consumer fails, the message reappears for another attempt</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="table-storage-the-pipeline-memory">Table Storage: The Pipeline Memory<a href="https://www.recodehive.com/blog/azure-storage-options#table-storage-the-pipeline-memory" class="hash-link" aria-label="Direct link to Table Storage: The Pipeline Memory" title="Direct link to Table Storage: The Pipeline Memory" translate="no">​</a></h3>
<p>Table Storage is Azure's <strong>NoSQL key-value store</strong>, schemaless rows of properties, queried by partition and row key.</p>
<p>In data pipelines, Table Storage earns its place as the <strong>watermark store</strong>, the place that remembers where a pipeline left off.</p>
<p>Imagine your ADF pipeline runs every night and ingests new rows from a source database. It can't re-read everything from day one every night. Instead, it records the <code>last_run_timestamp</code> in a Table Storage entity:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">PartitionKey: "sales_pipeline"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">RowKey:       "last_run"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Timestamp:    "2024-01-15T02:00:00Z"</span><br></div></code></pre></div></div>
<p>Next run, the pipeline reads this value, queries only rows updated since then, and updates the watermark when done. This is called <strong>incremental ingestion</strong> and Table Storage is the simplest, cheapest place to track it.</p>
<p><strong>Other pipeline uses for Table Storage:</strong></p>
<ul>
<li class="">Pipeline run metadata (status, row counts, duration)</li>
<li class="">Configuration values shared across pipeline activities</li>
<li class="">Simple lookup tables for reference data enrichment</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="file-storage-a-quick-note">File Storage: A Quick Note<a href="https://www.recodehive.com/blog/azure-storage-options#file-storage-a-quick-note" class="hash-link" aria-label="Direct link to File Storage: A Quick Note" title="Direct link to File Storage: A Quick Note" translate="no">​</a></h2>
<p>Azure File Storage provides a <strong>managed SMB file share</strong> in the cloud, the kind you mount as a network drive in Windows (<code>\\server\share</code>).</p>
<p>For data engineering pipelines, you'll rarely reach for File Storage. It's primarily useful for <strong>lift-and-shift migrations</strong>, moving on-premises applications to Azure when those applications expect to read from a network file share and you don't want to refactor them.</p>
<p>If you're building a new pipeline from scratch, ADLS Gen2 is almost always the right choice over File Storage for analytics workloads.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="adls-gen2-vs-plain-blob-storage--when-to-use-which">ADLS Gen2 vs Plain Blob Storage — When to Use Which<a href="https://www.recodehive.com/blog/azure-storage-options#adls-gen2-vs-plain-blob-storage--when-to-use-which" class="hash-link" aria-label="Direct link to ADLS Gen2 vs Plain Blob Storage — When to Use Which" title="Direct link to ADLS Gen2 vs Plain Blob Storage — When to Use Which" translate="no">​</a></h2>
<table><thead><tr><th>Scenario</th><th>Use</th></tr></thead><tbody><tr><td>Raw file landing zone for a big data pipeline</td><td><strong>ADLS Gen2</strong></td></tr><tr><td>Serving images or videos to a web application</td><td><strong>Blob Storage</strong></td></tr><tr><td>VM disk backups or snapshots</td><td><strong>Blob Storage</strong></td></tr><tr><td>Spark / Databricks / Synapse analytics workloads</td><td><strong>ADLS Gen2</strong></td></tr><tr><td>Bronze / Silver / Gold Medallion layers</td><td><strong>ADLS Gen2</strong></td></tr><tr><td>Simple static file hosting</td><td><strong>Blob Storage</strong></td></tr><tr><td>ML training datasets and model artifacts</td><td><strong>ADLS Gen2</strong></td></tr><tr><td>Microsoft Fabric / OneLake backend</td><td><strong>ADLS Gen2</strong></td></tr></tbody></table>
<p>The pricing is identical. The difference is entirely in the <strong>hierarchical namespace</strong> and the performance characteristics it unlocks for analytics.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-full-picture-one-pipeline-all-four-storage-types">The Full Picture: One Pipeline, All Four Storage Types<a href="https://www.recodehive.com/blog/azure-storage-options#the-full-picture-one-pipeline-all-four-storage-types" class="hash-link" aria-label="Direct link to The Full Picture: One Pipeline, All Four Storage Types" title="Direct link to The Full Picture: One Pipeline, All Four Storage Types" translate="no">​</a></h2>
<p>Here's how everything we've covered fits into a single, real data engineering pipeline — the kind you'd actually build in Azure:</p>
<p><img decoding="async" loading="lazy" alt="End-to-end Azure data pipeline showing all four storage types in their roles: ADLS Gen2 as Bronze/Silver/Gold layers, Queue Storage as event trigger, Table Storage as watermark store, and the full flow from API through ADF, Databricks, to Power BI" src="https://www.recodehive.com/assets/images/azure-storage-full-pipeline-5f1bb2b1700fa9f4143fdba24e171f19.png" width="1672" height="941" class="img_ev3q"></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">REST API (sales data source)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Azure Data Factory (orchestration)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓ writes raw Parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADLS Gen2 — bronze/sales/2024/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Azure Databricks (Spark: clean, deduplicate, validate)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓ writes Delta tables</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADLS Gen2 — silver/sales/2024/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Azure Databricks (Spark: aggregate, calculate metrics)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓ writes business-ready Delta tables</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADLS Gen2 — gold/sales/2024/</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Power BI (DirectLake mode — no import, always current)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Business dashboard</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Supporting roles:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── Queue Storage → ADF pipeline triggered by file arrival event</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">└── Table Storage → watermark ("last ingested: 2024-01-15T02:00:00Z")</span><br></div></code></pre></div></div>
<p>Every storage type has one job. None of them overlap. And ADLS Gen2 is the spine the whole thing runs on.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-decision-guide-one-question-at-a-time">The Decision Guide: One Question at a Time<a href="https://www.recodehive.com/blog/azure-storage-options#the-decision-guide-one-question-at-a-time" class="hash-link" aria-label="Direct link to The Decision Guide: One Question at a Time" title="Direct link to The Decision Guide: One Question at a Time" translate="no">​</a></h2>
<p>When you're building a pipeline and need to decide where something lives, ask these questions in order:</p>
<p><strong>Is it a file that a Spark job or analytics tool needs to read?</strong>
→ ADLS Gen2</p>
<p><strong>Is it a file served to end users (images, videos, downloads)?</strong>
→ Blob Storage</p>
<p><strong>Is it a message that needs to trigger something downstream?</strong>
→ Queue Storage</p>
<p><strong>Is it small structured data - a config value, a watermark, a metadata record?</strong>
→ Table Storage</p>
<p><strong>Is it a file share that a VM or legacy app needs to mount over SMB?</strong>
→ File Storage</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-key-lessons">The Key Lessons<a href="https://www.recodehive.com/blog/azure-storage-options#the-key-lessons" class="hash-link" aria-label="Direct link to The Key Lessons" title="Direct link to The Key Lessons" translate="no">​</a></h2>
<p><strong>1. Azure storage is four different things.</strong> Each one has a specific job. Using the wrong one is a surprisingly easy mistake to make on day one and a frustrating one to debug.</p>
<p><strong>2. ADLS Gen2 is Blob Storage with one upgrade that changes everything.</strong> The hierarchical namespace turns flat object storage into a real file system. That single feature is why every serious Azure analytics service is built on top of it.</p>
<p><strong>3. ADLS Gen2 is the Bronze/Silver/Gold spine of Medallion Architecture.</strong> The layers aren't abstract concepts, they're real directories in a container, with Spark jobs and ADF pipelines connecting them.</p>
<p><strong>4. Queue and Table Storage are the glue.</strong> They're not glamorous, but production pipelines depend on them for event triggering and state management.</p>
<p><strong>5. OneLake is ADLS Gen2.</strong> When you use Microsoft Fabric, you're using ADLS Gen2 underneath. Understanding the storage layer means you understand what every Azure data platform is actually built on.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references--further-reading">References &amp; Further Reading<a href="https://www.recodehive.com/blog/azure-storage-options#references--further-reading" class="hash-link" aria-label="Direct link to References &amp; Further Reading" title="Direct link to References &amp; Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs — Introduction to Azure Data Lake Storage Gen2</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/storage/common/storage-introduction" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs — Azure Storage Overview</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs — Storage Account Overview</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs — ABFS Driver for ADLS Gen2</a></li>
<li class=""><a href="https://www.recodehive.com/blog/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">RecodeHive — Medallion Architecture Explained</a></li>
<li class=""><a href="https://www.recodehive.com/blog/microsoft-fabric-one-platform-one-lake-every-data-workload" target="_blank" rel="noopener noreferrer" class="">RecodeHive — Microsoft Fabric: One Platform, One Lake</a></li>
<li class=""><a href="https://www.recodehive.com/blog/lakehouse-vs-data-warehouse" target="_blank" rel="noopener noreferrer" class="">RecodeHive — Lakehouse vs Data Warehouse</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/azure-storage-options#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p>I'm <strong>Aditya Singh Rathore</strong>, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a> — breaking down complex concepts into things you can actually use.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Building something on Azure and stuck on storage decisions? Drop your question in the comments.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>azure-storage</category>
            <category>blob-storage</category>
            <category>adls-gen2</category>
            <category>azure-data-lake</category>
            <category>queue-storage</category>
            <category>table-storage</category>
            <category>file-storage</category>
            <category>data-engineering</category>
            <category>azure</category>
            <category>big-data</category>
            <category>medallion-architecture</category>
        </item>
        <item>
            <title><![CDATA[Azure Synapse Analytics: When to Use It (And When to Choose Fabric Instead)]]></title>
            <link>https://www.recodehive.com/blog/azure-synapse-analytics</link>
            <guid>https://www.recodehive.com/blog/azure-synapse-analytics</guid>
            <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Azure Synapse Analytics is one of the most powerful tools in the Azure data stack. But in 2026, with Microsoft Fabric growing fast, the question isn't just "what is Synapse?" — it's "when should you still use it, and when should you move to Fabric?" Here's the honest answer.]]></description>
            <content:encoded><![CDATA[<p>When I first started working seriously with Azure, Synapse was the answer to almost every data question.</p>
<p>Need a SQL warehouse? Synapse. Need Spark for big data? Synapse. Need pipelines to move data? Synapse. Need to query files sitting in ADLS Gen2 without loading them anywhere? Synapse.</p>
<p>It was genuinely impressive, one workspace that brought together SQL, Spark, pipelines, and storage into a single studio. I built three production pipelines on it and it worked well.</p>
<p>Then Microsoft Fabric arrived.</p>
<p>And now the question I get asked most often is: <em>"Should I still use Synapse, or should I move to Fabric?"</em></p>
<p>The honest answer is: <strong>it depends on where you are in your Azure journey.</strong> This blog gives you the full picture, what Synapse actually is, when it's the right call, when Fabric is the better choice, and how to think about the transition if you're already on Synapse.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-azure-synapse-analytics-actually-is">What Azure Synapse Analytics Actually Is<a href="https://www.recodehive.com/blog/azure-synapse-analytics#what-azure-synapse-analytics-actually-is" class="hash-link" aria-label="Direct link to What Azure Synapse Analytics Actually Is" title="Direct link to What Azure Synapse Analytics Actually Is" translate="no">​</a></h2>
<p>Azure Synapse Analytics started as the next step beyond Azure SQL Data Warehouse, but over time it evolved into a much broader analytics platform rather than remaining just a cloud data warehouse solution.</p>
<p>What changed significantly was the addition of multiple processing engines and integrated tooling within a single workspace. Instead of working only with SQL-based warehousing, teams could now combine:</p>
<ul>
<li class="">large-scale Spark processing</li>
<li class="">SQL analytics</li>
<li class="">real-time exploration capabilities</li>
<li class="">orchestration pipelines</li>
<li class="">integrated data lake access</li>
</ul>
<p>This shift made Synapse more of a unified analytics ecosystem on Azure, where data engineering, big data processing, and reporting workloads could coexist within the same platform experience.</p>
<p>One of the biggest differences compared to the earlier SQL Data Warehouse model is that Synapse tries to reduce the fragmentation between storage, transformation, orchestration, and analytics services that previously had to be managed separately.</p>
<p>In plain terms: it's a unified analytics platform that brings together four things that used to require four separate Azure services:</p>
<ul>
<li class=""><strong>SQL analytics</strong> - for querying structured data at scale</li>
<li class=""><strong>Apache Spark</strong> - for big data processing, ML, and complex transformations</li>
<li class=""><strong>Data integration (Synapse Pipelines)</strong> - for moving and transforming data across systems</li>
<li class=""><strong>A unified workspace (Synapse Studio)</strong> - where all of the above live together</li>
</ul>
<p><img decoding="async" loading="lazy" alt="Azure Synapse Analytics architecture showing four core components: Dedicated SQL Pool, Serverless SQL Pool, Apache Spark Pool, and Synapse Pipelines — all connected to ADLS Gen2 storage and accessible via Synapse Studio" src="https://www.recodehive.com/assets/images/synapse-architecture-767a1bdcc66e87b317f519d5aae66213.png" width="1672" height="941" class="img_ev3q"></p>
<p>The key architectural principle underneath all of this is the <strong>separation of compute and storage</strong>. This decoupling allows organizations to scale their processing power independently of their data volume, compute resources can be ramped up to handle peak query loads and then scaled down or even paused during periods of inactivity, all without affecting the underlying data stored in ADLS Gen2.</p>
<p>That's a big deal in practice. You pay for compute only when you use it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-four-core-components---what-each-one-does">The Four Core Components - What Each One Does<a href="https://www.recodehive.com/blog/azure-synapse-analytics#the-four-core-components---what-each-one-does" class="hash-link" aria-label="Direct link to The Four Core Components - What Each One Does" title="Direct link to The Four Core Components - What Each One Does" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-dedicated-sql-pools-high-performance-data-warehousing">1. Dedicated SQL Pools: High-Performance Data Warehousing<a href="https://www.recodehive.com/blog/azure-synapse-analytics#1-dedicated-sql-pools-high-performance-data-warehousing" class="hash-link" aria-label="Direct link to 1. Dedicated SQL Pools: High-Performance Data Warehousing" title="Direct link to 1. Dedicated SQL Pools: High-Performance Data Warehousing" translate="no">​</a></h3>
<p>Dedicated SQL Pools are Synapse's data warehousing engine. You provision a fixed amount of compute capacity measured in <strong>Data Warehouse Units (DWUs)</strong>, and in return you get consistent, predictable query performance.</p>
<p>Dedicated SQL pools provision reserved compute capacity measured in Data Warehouse Units. They deliver consistent performance for production workloads, scheduled reports, and dashboards that need predictable response times.</p>
<p>This is the right choice when:</p>
<ul>
<li class="">You have large, structured datasets that are queried repeatedly by BI tools</li>
<li class="">You need consistent sub-second query performance for dashboards</li>
<li class="">Your team works primarily in T-SQL</li>
<li class="">You're migrating from an on-premises SQL Server or Oracle data warehouse</li>
</ul>
<p>The trade-off: you pay for the provisioned DWUs whether you're running queries or not. It's expensive to leave a Dedicated SQL Pool running 24/7 for workloads that only query it during business hours.</p>
<p><strong>The practical fix:</strong> pause your Dedicated SQL Pool outside business hours. Synapse lets you do this programmatically via Azure Automation or ADF pipelines — you only pay for compute when it's actually running.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-serverless-sql-pool-query-without-loading">2. Serverless SQL Pool: Query Without Loading<a href="https://www.recodehive.com/blog/azure-synapse-analytics#2-serverless-sql-pool-query-without-loading" class="hash-link" aria-label="Direct link to 2. Serverless SQL Pool: Query Without Loading" title="Direct link to 2. Serverless SQL Pool: Query Without Loading" translate="no">​</a></h3>
<p>Serverless SQL Pool is probably one of the most practical and underrated capabilities inside Azure Synapse.</p>
<p>What makes it interesting is how quickly you can start querying data directly from your data lake without provisioning dedicated infrastructure upfront. Instead of maintaining a constantly running cluster, the engine dynamically allocates compute only when a query is executed.</p>
<p>Under the hood, queries are distributed across multiple compute resources and processed in parallel, which makes it surprisingly efficient for exploratory analysis and lightweight analytical workloads.</p>
<p>The pricing model is also very different from traditional warehouses. Since billing is based on the amount of data scanned per query, it works particularly well for:</p>
<ul>
<li class="">ad-hoc analysis</li>
<li class="">one-time investigations</li>
<li class="">querying historical files</li>
<li class="">lightweight reporting workloads</li>
<li class="">infrequently accessed datasets</li>
</ul>
<p>The first time I used it, the biggest surprise was how quickly I could run SQL directly on files sitting in ADLS without setting up ingestion pipelines or persistent compute.</p>
<p>In practice: you can write a SQL query directly against Parquet, CSV, or Delta files sitting in ADLS Gen2 <strong>without loading them into any database first</strong>.</p>
<div class="language-sql codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-sql codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic">-- Query a Parquet file in ADLS Gen2 directly — no loading required</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">SELECT</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    region</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token function" style="color:#d73a49">SUM</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">amount</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">AS</span><span class="token plain"> total_revenue</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token function" style="color:#d73a49">COUNT</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">order_id</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">AS</span><span class="token plain"> total_orders</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">FROM</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">OPENROWSET</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">BULK</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'https://mylake.dfs.core.windows.net/silver/sales/2024/**'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">        FORMAT </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'PARQUET'</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">AS</span><span class="token plain"> sales_data</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">GROUP</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">BY</span><span class="token plain"> region</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">ORDER</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">BY</span><span class="token plain"> total_revenue </span><span class="token keyword" style="color:#00009f">DESC</span><span class="token punctuation" style="color:#393A34">;</span><br></div></code></pre></div></div>
<p>You pay for the bytes scanned by that query. Nothing more.</p>
<p>This is the right choice when:</p>
<ul>
<li class="">You need to explore raw data in ADLS Gen2 before deciding how to model it</li>
<li class="">You have analysts who know SQL but don't want to write Spark code</li>
<li class="">You're running occasional ad-hoc queries that don't justify provisioning a dedicated warehouse</li>
<li class="">You want to build a <strong>logical data warehouse</strong> on top of your data lake without moving data</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-apache-spark-pools-big-data-and-ml-workloads">3. Apache Spark Pools: Big Data and ML Workloads<a href="https://www.recodehive.com/blog/azure-synapse-analytics#3-apache-spark-pools-big-data-and-ml-workloads" class="hash-link" aria-label="Direct link to 3. Apache Spark Pools: Big Data and ML Workloads" title="Direct link to 3. Apache Spark Pools: Big Data and ML Workloads" translate="no">​</a></h3>
<p>Azure Synapse Analytics includes deeply integrated Apache Spark capabilities, allowing teams to work with large-scale data processing directly within the Synapse workspace instead of managing separate big data platforms.</p>
<p>Spark Pools provide a managed Spark environment where engineers and data scientists can build ETL pipelines, prepare large datasets, process semi-structured or unstructured data, and develop machine learning workflows using familiar notebook-based development.</p>
<p>One thing I found particularly useful is that infrastructure management is mostly abstracted away. You can write notebooks using Python, Scala, SQL, or R while Synapse handles much of the operational overhead like cluster provisioning, scaling, and session management behind the scenes.</p>
<p>This makes Spark Pools especially practical for workloads that go beyond traditional SQL transformations and require distributed computation at scale.</p>
<p>This is the right choice when:</p>
<ul>
<li class="">Your transformations are too complex for SQL alone</li>
<li class="">You're building ML pipelines or training models on large datasets</li>
<li class="">You need to process semi-structured data (JSON, nested arrays) at scale</li>
<li class="">Your data engineering team is comfortable in PySpark or Scala</li>
</ul>
<p>The key advantage over standalone Spark clusters: Spark Pools share the same workspace as your SQL Pools and Pipelines. A Spark notebook can write a Delta table that a SQL analyst can immediately query without any data movement or cross-service configuration.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-synapse-pipelines-data-integration-and-orchestration">4. Synapse Pipelines: Data Integration and Orchestration<a href="https://www.recodehive.com/blog/azure-synapse-analytics#4-synapse-pipelines-data-integration-and-orchestration" class="hash-link" aria-label="Direct link to 4. Synapse Pipelines: Data Integration and Orchestration" title="Direct link to 4. Synapse Pipelines: Data Integration and Orchestration" translate="no">​</a></h3>
<p>Synapse Pipelines is the data integration layer. It uses the same engine as Azure Data Factory, which means teams already using ADF will recognize the interface and functionality. Pipelines handle the movement and transformation of data across systems connecting to sources, extracting data, applying transformations, and loading results into destinations.</p>
<p>If you've used Azure Data Factory, Synapse Pipelines will feel immediately familiar. It's the same visual, activity-based orchestration tool with 95+ connectors to external systems, built directly into the Synapse workspace.</p>
<p>The advantage over standalone ADF: your pipelines live in the same workspace as your SQL and Spark workloads. You can trigger a Spark notebook, run a SQL script, and copy data to ADLS Gen2, all within a single pipeline, without leaving Synapse Studio.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-synapse-studio-actually-looks-like">What Synapse Studio Actually Looks Like<a href="https://www.recodehive.com/blog/azure-synapse-analytics#what-synapse-studio-actually-looks-like" class="hash-link" aria-label="Direct link to What Synapse Studio Actually Looks Like" title="Direct link to What Synapse Studio Actually Looks Like" translate="no">​</a></h2>
<p>Synapse Studio is the unified web-based interface that ties everything together. From one interface, teams can write and execute SQL queries against data warehouse tables, build and run Apache Spark notebooks, design data pipelines using visual drag-and-drop tools, monitor jobs, manage resources, and configure security settings. Data engineers building pipelines and analysts writing reports work in the same environment with access to the same underlying data.</p>
<p>In practice, this means less context-switching. When I was building pipelines on Synapse, the biggest quality-of-life win was being able to debug a Spark notebook, run a SQL query against its output, and check the pipeline that triggered it, all in the same browser tab.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="real-world-use-cases---when-synapse-is-the-right-call">Real-World Use Cases - When Synapse Is the Right Call<a href="https://www.recodehive.com/blog/azure-synapse-analytics#real-world-use-cases---when-synapse-is-the-right-call" class="hash-link" aria-label="Direct link to Real-World Use Cases - When Synapse Is the Right Call" title="Direct link to Real-World Use Cases - When Synapse Is the Right Call" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="use-case-1-enterprise-data-warehouse-migration">Use Case 1: Enterprise Data Warehouse Migration<a href="https://www.recodehive.com/blog/azure-synapse-analytics#use-case-1-enterprise-data-warehouse-migration" class="hash-link" aria-label="Direct link to Use Case 1: Enterprise Data Warehouse Migration" title="Direct link to Use Case 1: Enterprise Data Warehouse Migration" translate="no">​</a></h3>
<p>Organizations moving from on-premises data warehouses like SQL Server or Oracle to Azure Synapse benefit from enhanced scalability, cost savings, and better performance.</p>
<p>If your team is deeply invested in T-SQL, has existing stored procedures and reporting logic, and is migrating from SQL Server or Azure SQL DW — Synapse's Dedicated SQL Pool is the most natural landing spot. The syntax is familiar, the tooling is mature, and the migration path is well-documented.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="use-case-2-ad-hoc-exploration-on-a-data-lake">Use Case 2: Ad-Hoc Exploration on a Data Lake<a href="https://www.recodehive.com/blog/azure-synapse-analytics#use-case-2-ad-hoc-exploration-on-a-data-lake" class="hash-link" aria-label="Direct link to Use Case 2: Ad-Hoc Exploration on a Data Lake" title="Direct link to Use Case 2: Ad-Hoc Exploration on a Data Lake" translate="no">​</a></h3>
<p>You've landed months of raw data in ADLS Gen2 and need to understand what's in it before building a formal pipeline. Serverless SQL Pool lets analysts write SQL against those files immediately without waiting for a data engineer to model the data first.</p>
<p>This is genuinely one of Synapse's strongest differentiators. No other Azure service lets SQL analysts query raw Parquet files on a data lake this directly, this cheaply.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="use-case-3-mixed-sql--spark-workloads">Use Case 3: Mixed SQL + Spark Workloads<a href="https://www.recodehive.com/blog/azure-synapse-analytics#use-case-3-mixed-sql--spark-workloads" class="hash-link" aria-label="Direct link to Use Case 3: Mixed SQL + Spark Workloads" title="Direct link to Use Case 3: Mixed SQL + Spark Workloads" translate="no">​</a></h3>
<p>Your team has SQL analysts querying a data warehouse and data engineers running Spark transformation jobs. In most stacks, these two groups work in separate tools with separate data copies.</p>
<p>In Synapse, Spark can write a Delta table that the SQL pool reads, and SQL results can feed back into Spark notebooks without data movement between services. Both groups work against the same underlying data in ADLS Gen2.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="use-case-4-regulated-industries-requiring-network-isolation">Use Case 4: Regulated Industries Requiring Network Isolation<a href="https://www.recodehive.com/blog/azure-synapse-analytics#use-case-4-regulated-industries-requiring-network-isolation" class="hash-link" aria-label="Direct link to Use Case 4: Regulated Industries Requiring Network Isolation" title="Direct link to Use Case 4: Regulated Industries Requiring Network Isolation" translate="no">​</a></h3>
<p>Synapse has mature support for managed virtual networks and private endpoints. For teams in finance, healthcare, or government where strict data residency and network isolation are non-negotiable requirements, Synapse's mature networking controls are a significant advantage over Fabric, whose networking story is still evolving.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="synapse-vs-fabric-the-honest-comparison">Synapse vs Fabric: The Honest Comparison<a href="https://www.recodehive.com/blog/azure-synapse-analytics#synapse-vs-fabric-the-honest-comparison" class="hash-link" aria-label="Direct link to Synapse vs Fabric: The Honest Comparison" title="Direct link to Synapse vs Fabric: The Honest Comparison" translate="no">​</a></h2>
<p>Azure Synapse Analytics is a platform-as-a-service (PaaS) solution that provides modular components giving fine-grained control over data workflows. Microsoft Fabric represents a software-as-a-service (SaaS) approach bringing everything together into a single unified platform with shared governance, compute, and storage through OneLake.</p>
<table><thead><tr><th>Dimension</th><th>Azure Synapse</th><th>Microsoft Fabric</th></tr></thead><tbody><tr><td><strong>Deployment model</strong></td><td>PaaS - you manage compute resources</td><td>SaaS - fully managed</td></tr><tr><td><strong>Storage</strong></td><td>ADLS Gen2 (you manage)</td><td>OneLake (unified, managed for you)</td></tr><tr><td><strong>SQL engine</strong></td><td>Dedicated + Serverless SQL Pools</td><td>Fabric Warehouse + SQL analytics endpoint</td></tr><tr><td><strong>Spark</strong></td><td>Apache Spark Pools</td><td>Fabric Spark (same engine, newer experience)</td></tr><tr><td><strong>Pipelines</strong></td><td>Synapse Pipelines (ADF engine)</td><td>Fabric Data Factory (next-gen ADF)</td></tr><tr><td><strong>Real-time</strong></td><td>Data Explorer (partially retired)</td><td>Eventstreams + Eventhouse (KQL)</td></tr><tr><td><strong>Network isolation</strong></td><td>Mature - managed VNet, private endpoints</td><td>Still evolving</td></tr><tr><td><strong>T-SQL support</strong></td><td>Full</td><td>Some gaps (OPENROWSET and others)</td></tr><tr><td><strong>AI / Copilot</strong></td><td>Limited</td><td>Built-in Copilot across all workloads</td></tr><tr><td><strong>Direction</strong></td><td>Maintenance mode</td><td>Active investment - new features land here first</td></tr><tr><td><strong>Best for</strong></td><td>Existing investments, regulated industries, SQL-heavy teams</td><td>Greenfield projects, unified analytics, AI workloads</td></tr></tbody></table>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="should-you-migrate-from-synapse-to-fabric">Should You Migrate from Synapse to Fabric?<a href="https://www.recodehive.com/blog/azure-synapse-analytics#should-you-migrate-from-synapse-to-fabric" class="hash-link" aria-label="Direct link to Should You Migrate from Synapse to Fabric?" title="Direct link to Should You Migrate from Synapse to Fabric?" translate="no">​</a></h2>
<p>If you're already on Synapse, here's the pragmatic framework:</p>
<p><strong>Migrate these workloads to Fabric now:</strong></p>
<ul>
<li class="">Spark-based data engineering notebooks and jobs</li>
<li class="">Synapse Pipelines (the migration assistant handles most of this automatically)</li>
<li class="">Real-time analytics workloads (Fabric's Eventhouse is better than Data Explorer)</li>
<li class="">Power BI-connected workloads (DirectLake mode is a significant upgrade)</li>
</ul>
<p><strong>Keep these on Synapse for now:</strong></p>
<ul>
<li class="">Workloads that depend heavily on Dedicated SQL Pool features</li>
<li class="">Pipelines that require complex network isolation or private endpoints</li>
<li class="">Anything using features that don't have a Fabric equivalent yet (OPENROWSET, Synapse Link for some sources)</li>
</ul>
<p>A phased approach works best: migrate greenfield workloads to Fabric immediately, then build a roadmap for existing Synapse workloads as Fabric's feature gaps close.</p>
<p>The good news: the migration assistant automatically migrates core Spark artifacts from Azure Synapse Analytics into Fabric Data Engineering, bringing over Spark pools, notebooks, and Spark job definitions with no data moved during the process.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-key-lessons">The Key Lessons<a href="https://www.recodehive.com/blog/azure-synapse-analytics#the-key-lessons" class="hash-link" aria-label="Direct link to The Key Lessons" title="Direct link to The Key Lessons" translate="no">​</a></h2>
<p><strong>1. Synapse is not dead but it's not the future either.</strong> It's a fully supported, production-ready platform that will be around for years. But Microsoft's innovation is going into Fabric, not Synapse.</p>
<p><strong>2. Serverless SQL Pool is genuinely underrated.</strong> The ability to query raw files in ADLS Gen2 with SQL, paying only for bytes scanned, is one of the most cost-efficient features in the entire Azure data stack. Even if you move to Fabric, this pattern is worth understanding.</p>
<p><strong>3. For greenfield projects in 2026, start with Fabric.</strong> The OneLake architecture, the unified experience, and the Copilot integration make it the better starting point for anything new.</p>
<p><strong>4. For existing Synapse investments, migrate in phases.</strong> Don't rush a full migration. Move Spark workloads and pipelines first. Evaluate Dedicated SQL Pool workloads carefully before touching them.</p>
<p><strong>5. The separation of compute and storage matters.</strong> Whether you're on Synapse or Fabric, the underlying principle is the same, your data lives in ADLS Gen2 / OneLake, and your compute scales independently. Understanding this makes both platforms easier to reason about.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references--further-reading">References &amp; Further Reading<a href="https://www.recodehive.com/blog/azure-synapse-analytics#references--further-reading" class="hash-link" aria-label="Direct link to References &amp; Further Reading" title="Direct link to References &amp; Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/overview-what-is" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - Azure Synapse Analytics Overview</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - Serverless SQL Pool</a></li>
<li class=""><a href="https://community.fabric.microsoft.com/t5/Fabric-Updates-Blogs/From-Azure-Synapse-and-Azure-Data-Factory-to-Microsoft-Fabric/ba-p/5172227" target="_blank" rel="noopener noreferrer" class="">Microsoft Fabric Blog - Migrating from Synapse to Fabric</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/fabric/data-engineering/migrate-synapse-data-pipelines" target="_blank" rel="noopener noreferrer" class="">Microsoft Docs - Migrate Synapse Pipelines to Fabric</a></li>
<li class=""><a href="https://www.recodehive.com/blog/microsoft-fabric-explained" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Microsoft Fabric: One Platform, One Lake</a></li>
<li class=""><a href="https://www.recodehive.com/blog/azure-storage-options" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Azure Storage &amp; ADLS Gen2</a></li>
<li class=""><a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Lakehouse vs Data Warehouse</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/azure-synapse-analytics#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p>I'm <strong>Aditya Singh Rathore</strong>, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a> breaking down complex concepts into things you can actually use.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Still on Synapse and thinking about Fabric? Drop your questions in the comments, happy to help you think through the migration.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>azure-synapse-analytics</category>
            <category>data-engineering</category>
            <category>sql-pools</category>
            <category>apache-spark</category>
            <category>microsoft-fabric</category>
            <category>data-warehouse</category>
            <category>adls-gen2</category>
            <category>azure</category>
            <category>big-data</category>
            <category>etl</category>
        </item>
        <item>
            <title><![CDATA[Why We Rolled Back Our Kafka Pipeline to Batch After 6 Months]]></title>
            <link>https://www.recodehive.com/blog/batch-vs-stream-processing</link>
            <guid>https://www.recodehive.com/blog/batch-vs-stream-processing</guid>
            <pubDate>Wed, 06 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Everyone talks about the benefits of streaming pipelines — real-time insights, millisecond latency, live dashboards. Nobody talks about what it actually costs you. I rebuilt a working batch pipeline as a streaming system. Here's what I learned the hard way.]]></description>
            <content:encoded><![CDATA[<p>Everyone in data engineering is obsessed with real time.</p>
<p>Kafka. Flink. Event-driven architectures. Millisecond latency. Live dashboards. It's the direction every conference talk points, every job description asks for, every architecture diagram proudly features.</p>
<p>And I bought into it completely.</p>
<p>About a year into my data engineering career, our product team came to us with a request: customers wanted to see their order status update in real time. Our existing batch pipeline ran at 2am every night, customers were calling support asking where their orders were.</p>
<p>Reasonable ask. So we rebuilt the pipeline as a streaming system.</p>
<p>Six months later, I had learned more about the real cost of streaming than any blog post or conference talk had ever prepared me for.</p>
<p>This is that story — and the honest breakdown I wish someone had given me before I started.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-had-before-and-why-it-worked">What We Had Before (And Why It Worked)<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#what-we-had-before-and-why-it-worked" class="hash-link" aria-label="Direct link to What We Had Before (And Why It Worked)" title="Direct link to What We Had Before (And Why It Worked)" translate="no">​</a></h2>
<p>Our original order pipeline was batch. It ran every night at 2am via Azure Data Factory, pulled 24 hours of orders from our SQL database, ran a Spark transformation job, and wrote clean Delta tables to ADLS Gen2.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Every night at 2am:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">ADF Pipeline triggers</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Pull all orders from the last 24 hours</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Spark: clean → deduplicate → join product catalog</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Write to Silver layer (Delta table on ADLS Gen2)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Aggregate into Gold layer</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ↓</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Power BI refreshes — customers see updated status</span><br></div></code></pre></div></div>
<p>It ran in 45 minutes. Our Spark cluster spun up, did its job, and shut down. We paid for 45 minutes of compute per day. The pipeline was simple, debuggable, and recoverable, if something broke, we fixed it and replayed from Bronze.</p>
<p>The only problem: customers saw data that was 6 to 30 hours old depending on when they ordered.</p>
<p>For most use cases, that's fine. For order status, it wasn't.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="hidden-cost-1---infrastructure-that-never-sleeps">Hidden Cost #1 - Infrastructure That Never Sleeps<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#hidden-cost-1---infrastructure-that-never-sleeps" class="hash-link" aria-label="Direct link to Hidden Cost #1 - Infrastructure That Never Sleeps" title="Direct link to Hidden Cost #1 - Infrastructure That Never Sleeps" translate="no">​</a></h2>
<p>The first thing that surprised me about our streaming pipeline was the infrastructure bill.</p>
<p>Our batch Spark cluster ran 45 minutes a day. Our Kafka + Flink setup runs <strong>every minute of every day</strong> - 24 hours, 7 days a week, whether there are 10 events per second or 10,000.</p>
<p>Streaming infrastructure requires 24/7 uptime. You can't spin it down overnight to save money. You can't schedule it during off-peak hours. The pipeline is always on, always consuming resources, always incurring cost.</p>
<p>For our team, the monthly compute cost for the streaming pipeline was roughly <strong>4x</strong> what the equivalent batch job cost and that was before accounting for the additional engineering time to maintain it.</p>
<blockquote>
<p><strong>The question to ask before going streaming:</strong> Is the business value of real-time data worth 4x the infrastructure cost? Sometimes the answer is yes. Often it isn't.</p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="hidden-cost-2---late-arriving-data-will-break-your-logic">Hidden Cost #2 - Late-Arriving Data Will Break Your Logic<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#hidden-cost-2---late-arriving-data-will-break-your-logic" class="hash-link" aria-label="Direct link to Hidden Cost #2 - Late-Arriving Data Will Break Your Logic" title="Direct link to Hidden Cost #2 - Late-Arriving Data Will Break Your Logic" translate="no">​</a></h2>
<p>In a batch pipeline, late data is not a problem. If an event arrives 3 hours late, it's in the next batch. The pipeline processes it, life goes on.</p>
<p>In a streaming pipeline, late-arriving data is one of the hardest problems in distributed systems.</p>
<p>Events can arrive out of order due to network delays, retries, or clock skew between services. Your Flink job is processing event #1,000 when event #987 suddenly arrives 45 seconds late. What do you do?</p>
<p>The answer involves <strong>watermarking</strong>, telling your stream processor "wait X seconds after the event time before closing a window, to account for late arrivals." But choosing the right watermark is a balance:</p>
<ul>
<li class="">Too short: you miss late-arriving events, your aggregations are wrong</li>
<li class="">Too long: you hold state in memory longer, increasing latency and memory pressure</li>
</ul>
<p>We got this wrong twice before landing on a configuration that worked. Both times, our order counts were silently off by 1-3%, small enough to look like noise, large enough to cause problems in financial reconciliation.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Late data problem illustrated:</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Event time:  10:00  10:01  10:02  10:03  10:04</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Arrived at:  10:00  10:01  10:04  10:03  10:05</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                            ↑</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    event #3 arrived 2 minutes late</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    — already missed the 10:02 window</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">                    — your aggregate is wrong</span><br></div></code></pre></div></div>
<p>In batch, this doesn't exist as a problem. In streaming, it's a constant engineering challenge.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="hidden-cost-3---exactly-once-is-harder-than-it-sounds">Hidden Cost #3 - Exactly-Once Is Harder Than It Sounds<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#hidden-cost-3---exactly-once-is-harder-than-it-sounds" class="hash-link" aria-label="Direct link to Hidden Cost #3 - Exactly-Once Is Harder Than It Sounds" title="Direct link to Hidden Cost #3 - Exactly-Once Is Harder Than It Sounds" translate="no">​</a></h2>
<p>Handling failures in batch pipelines is usually predictable.<br>
<!-- -->If a batch job fails, you typically resolve the issue and rerun the pipeline from the beginning. Since the processing happens on bounded data, recovery is relatively straightforward.</p>
<p>Streaming systems work very differently.</p>
<p>In platforms like Kafka and Flink, data is continuously flowing through the system. If a streaming job crashes midway through processing, recovery becomes much more complex than simply restarting the job.</p>
<p>For example, after recovery:</p>
<ul>
<li class="">Should previously processed events be replayed?</li>
<li class="">Could some records get skipped unintentionally?</li>
<li class="">Is there a possibility that certain events are processed more than once?</li>
</ul>
<p>This challenge is commonly addressed through <strong>exactly-once processing guarantees</strong>, where the goal is to ensure that every event affects the system exactly one time even during failures and restarts.</p>
<p>Achieving reliable exactly-once behavior usually depends on several components working together correctly:</p>
<ul>
<li class="">Proper Kafka offset management</li>
<li class="">Reliable Flink checkpointing and state recovery</li>
<li class="">Idempotent writes to downstream systems</li>
<li class="">Consistent state synchronization during failover scenarios</li>
</ul>
<p>In practice, recovery bugs in streaming systems can have real operational impact. A single restart issue can lead to duplicate event processing, inconsistent downstream data, repeated customer notifications, or inaccurate analytics until the state is corrected.</p>
<p>Unlike batch systems, where failures often leave datasets untouched until rerun, streaming failures can leave systems in partially updated states that are significantly harder to debug and recover from.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="hidden-cost-4---testing-is-a-different-discipline">Hidden Cost #4 - Testing Is a Different Discipline<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#hidden-cost-4---testing-is-a-different-discipline" class="hash-link" aria-label="Direct link to Hidden Cost #4 - Testing Is a Different Discipline" title="Direct link to Hidden Cost #4 - Testing Is a Different Discipline" translate="no">​</a></h2>
<p>Testing a batch pipeline is relatively straightforward. You have a dataset, you run the transformation, you check the output. Deterministic, reproducible, easy to validate.</p>
<p>Testing a streaming pipeline requires simulating event streams with realistic timing, ordering, and volume. You need to test:</p>
<ul>
<li class="">What happens when events arrive out of order?</li>
<li class="">What happens when a consumer crashes and restarts?</li>
<li class="">What happens when Kafka lag builds up during a traffic spike?</li>
<li class="">What happens when an upstream service sends a malformed event?</li>
</ul>
<p>We discovered most of our edge cases in production, not in testing. Not because we were careless, but because accurately simulating a live event stream in a test environment is genuinely difficult.</p>
<p>Our batch pipeline had a test suite that ran in 8 minutes. Our streaming pipeline's test suite took 40 minutes and still missed three production bugs in the first month.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="hidden-cost-5---your-team-needs-streaming-specific-skills">Hidden Cost #5 - Your Team Needs Streaming-Specific Skills<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#hidden-cost-5---your-team-needs-streaming-specific-skills" class="hash-link" aria-label="Direct link to Hidden Cost #5 - Your Team Needs Streaming-Specific Skills" title="Direct link to Hidden Cost #5 - Your Team Needs Streaming-Specific Skills" translate="no">​</a></h2>
<p>This one is easy to underestimate.</p>
<p>Batch data engineering skills - Spark, SQL, dbt, ADF are well-understood, well-documented, and widely held. If someone on your team leaves, finding a replacement with those skills is manageable.</p>
<p>Streaming-specific skills Kafka internals, Flink state management, watermarking strategies, consumer group management, exactly-once configuration are genuinely harder to find and take longer to develop.</p>
<p>When we hit our first major Flink issue (a state backend misconfiguration causing memory pressure under load), our team spent three days debugging something that an experienced Flink engineer would have spotted in 20 minutes. We didn't have one. We learned on the job, which is fine but it was expensive learning.</p>
<blockquote>
<p>Before committing to a streaming architecture, ask: does your team have the skills to maintain it? And if not, what's the cost of developing those skills or hiring them?</p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="so-when-is-streaming-actually-worth-it">So When Is Streaming Actually Worth It?<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#so-when-is-streaming-actually-worth-it" class="hash-link" aria-label="Direct link to So When Is Streaming Actually Worth It?" title="Direct link to So When Is Streaming Actually Worth It?" translate="no">​</a></h2>
<p>None of this means streaming is wrong. It means streaming has a real cost that should be weighed against a real business need.</p>
<p>Streaming is worth it when the business problem <strong>genuinely cannot tolerate batch latency.</strong> Here's a clear test:</p>
<p><strong>Reach for streaming when:</strong></p>
<ul>
<li class="">Fraud needs to be detected <strong>before</strong> a transaction completes — batch latency means the fraud already happened</li>
<li class="">A customer's app needs to reflect a change <strong>within seconds</strong> of it occurring</li>
<li class="">A system needs to <strong>react</strong> to an event automatically — alerts, triggers, automated responses</li>
<li class="">You're processing IoT sensor data where stale readings are dangerous, not just inconvenient</li>
</ul>
<p><strong>Stick with batch when:</strong></p>
<ul>
<li class="">You're building monthly reports, financial summaries, or historical analyses</li>
<li class="">Your stakeholders check dashboards in the morning, not the second</li>
<li class="">Your transformations involve complex aggregations over large historical datasets</li>
<li class="">Your team is small and operational simplicity matters more than latency</li>
</ul>
<p>The tech industry is currently obsessed with "real-time," which has led many organizations to over-engineer their stacks implementing complex stream-processing frameworks where a simple batch job would have sufficed. A well-built batch pipeline is more reliable, cheaper, and easier to maintain than a poorly-justified streaming one.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-that-actually-works-both">The Architecture That Actually Works: Both<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#the-architecture-that-actually-works-both" class="hash-link" aria-label="Direct link to The Architecture That Actually Works: Both" title="Direct link to The Architecture That Actually Works: Both" translate="no">​</a></h2>
<p>Here's what I'd tell myself before starting that project:</p>
<p><strong>You probably need both, not either/or.</strong></p>
<p>Our final architecture uses batch for everything that can tolerate it, and streaming only for the specific cases that genuinely can't:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Streaming layer (Kafka + Flink):</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    Order events → real-time status updates (Cassandra)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    Fraud signals → real-time alerts (notification service)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Batch layer (Spark + ADF):</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    Nightly order aggregations → Silver → Gold (Power BI)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    Monthly revenue reports (finance team)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    ML training datasets (data science team)</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="Side-by-side architecture diagram showing batch and streaming layers working together. Streaming layer on top handles real-time events via Kafka + Flink into Cassandra. Batch layer below handles nightly Spark jobs into ADLS Gen2 Silver and Gold. Both layers feed into the same OneLake." src="https://www.recodehive.com/assets/images/batch-streaming-combined-architecture-ab0fb2c023be034ec20ccfe41d7ba4bc.png" width="1672" height="941" class="img_ev3q"></p>
<p>The streaming layer handles the 5% of use cases where seconds matter. The batch layer handles the 95% where they don't , more reliably, more cheaply, with less operational overhead.</p>
<p><a href="https://www.recodehive.com/blog/microsoft-fabric-explained" target="_blank" rel="noopener noreferrer" class="">Microsoft Fabric</a> is built around exactly this pattern, Eventstreams for real-time ingestion, ADF Pipelines and Spark Notebooks for batch transformation, both writing to the same OneLake. You don't have to choose one architecture. You choose the right tool for each use case within the same platform.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-honest-summary">The Honest Summary<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#the-honest-summary" class="hash-link" aria-label="Direct link to The Honest Summary" title="Direct link to The Honest Summary" translate="no">​</a></h2>
<table><thead><tr><th></th><th>Batch</th><th>Streaming</th></tr></thead><tbody><tr><td><strong>Infrastructure cost</strong></td><td>Low - runs on schedule</td><td>High - always on</td></tr><tr><td><strong>Latency</strong></td><td>Minutes to hours</td><td>Milliseconds to seconds</td></tr><tr><td><strong>Late data</strong></td><td>Not a problem</td><td>Significant engineering challenge</td></tr><tr><td><strong>Failure recovery</strong></td><td>Fix and rerun</td><td>Complex - risk of duplicates or data loss</td></tr><tr><td><strong>Testing</strong></td><td>Straightforward</td><td>Requires stream simulation</td></tr><tr><td><strong>Team skills needed</strong></td><td>Spark, SQL, ADF</td><td>Kafka, Flink, state management</td></tr><tr><td><strong>Best for</strong></td><td>Analytics, reporting, ML</td><td>Fraud detection, live status, alerts</td></tr><tr><td><strong>Operational complexity</strong></td><td>Low</td><td>High</td></tr></tbody></table>
<p>Streaming pipelines are powerful. They enable product experiences that batch simply can't deliver.</p>
<p>But they come with real costs - infrastructure that never sleeps, late-data handling that never stops being tricky, failure recovery that's genuinely hard to get right, and a skills requirement that's easy to underestimate.</p>
<p>The next time someone on your team says "we should make this real time", ask the question first:</p>
<p><strong>How long can the business actually wait for this data?</strong></p>
<p>If the honest answer is "overnight is fine" — keep the batch job. It's not boring. It's the right call.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references--further-reading">References &amp; Further Reading<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#references--further-reading" class="hash-link" aria-label="Direct link to References &amp; Further Reading" title="Direct link to References &amp; Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://docs.databricks.com/aws/en/data-engineering/batch-vs-streaming" target="_blank" rel="noopener noreferrer" class="">Databricks - Batch vs Streaming</a></li>
<li class=""><a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/time/" target="_blank" rel="noopener noreferrer" class="">Apache Flink - Watermarks and Late Data</a></li>
<li class=""><a href="https://kafka.apache.org/documentation/" target="_blank" rel="noopener noreferrer" class="">Apache Kafka Documentation</a></li>
<li class=""><a href="https://learn.microsoft.com/en-us/fabric/real-time-intelligence/overview" target="_blank" rel="noopener noreferrer" class="">Microsoft Fabric - Real-Time Intelligence</a></li>
<li class=""><a href="https://www.recodehive.com/blog/netflix-data-engineering" target="_blank" rel="noopener noreferrer" class="">RecodeHive - How Netflix Handles Millions of Events Every Minute</a></li>
<li class=""><a href="https://www.recodehive.com/blog/medallion-architecture" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Medallion Architecture Explained</a></li>
<li class=""><a href="https://www.recodehive.com/blog/microsoft-fabric-explained" target="_blank" rel="noopener noreferrer" class="">RecodeHive - Microsoft Fabric: One Platform, One Lake</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/batch-vs-stream-processing#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p>I'm <strong>Aditya Singh Rathore</strong>, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a>, turning hard-won lessons into content anyone can learn from.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Have you been burned by a streaming pipeline that didn't need to be? Drop it in the comments.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>batch-processing</category>
            <category>stream-processing</category>
            <category>data-engineering</category>
            <category>apache-kafka</category>
            <category>apache-flink</category>
            <category>apache-spark</category>
            <category>data-pipeline</category>
            <category>real-time</category>
            <category>azure</category>
            <category>medallion-architecture</category>
            <category>data-architecture</category>
        </item>
        <item>
            <title><![CDATA[How Netflix Handles 2 Trillion Events Every Day]]></title>
            <link>https://www.recodehive.com/blog/netflix-data-engineering</link>
            <guid>https://www.recodehive.com/blog/netflix-data-engineering</guid>
            <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Every click, pause, search, and scroll on Netflix generates an event. With 300 million subscribers across 190 countries, Netflix processes 2 trillion events every single day through a pipeline called Keystone. Here's a deep dive into how Kafka, Flink, Cassandra, and Iceberg make it all work in real time.]]></description>
            <content:encoded><![CDATA[<p>Right now, someone is pausing Stranger Things at the exact moment a jump scare hits.</p>
<p>Someone else just searched "action movies" and clicked the third result. Another person skipped the intro of a show they've watched five times. And somewhere, a user on a slow connection just had their video quality automatically drop from 4K to 1080p, without any buffering, without any prompt.</p>
<p>Every single one of these actions is an <strong>event</strong>. And Netflix captures all of them from 300 million subscribers across 190 countries, continuously, in real time.</p>
<p>The scale: <strong>2 trillion events every single day.</strong> That's 3 petabytes of data ingested, 7 petabytes output, at a peak rate of 12.5 million events per second. The system behind all of this is called <strong>Keystone</strong> - Netflix's internal real-time data pipeline, and understanding how it works is one of the most instructive case studies in modern data engineering.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-scale-problem-why-this-is-actually-hard">The Scale Problem: Why This Is Actually Hard<a href="https://www.recodehive.com/blog/netflix-data-engineering#the-scale-problem-why-this-is-actually-hard" class="hash-link" aria-label="Direct link to The Scale Problem: Why This Is Actually Hard" title="Direct link to The Scale Problem: Why This Is Actually Hard" translate="no">​</a></h2>
<p>Most people assume Netflix's hard problem is streaming video. It's not. The hard problem is streaming <em>data about</em> video.</p>
<p>Every time you interact with Netflix, dozens of microservices each emit their own events simultaneously. A single "press play" triggers events from the playback service, the recommendation service, the quality-monitoring service, the CDN routing service, and more, all at the same time. Now multiply that by 300 million concurrent users across different time zones.</p>
<p>Before Keystone, Netflix ran a batch pipeline built on Chukwa, Hadoop, and Hive. By 2015, logging volume had grown to 500 billion events per day and the system was collapsing. Netflix estimated they had <strong>six months</strong> to rebuild it as a streaming-first architecture before it failed completely under subscriber growth.</p>
<p>That pressure is why every architectural decision in Keystone was made under real production constraints not theoretical design.</p>
<p><img decoding="async" loading="lazy" alt="Netflix data infrastructure scale — 2 trillion events per day, 3PB ingested, 7PB output" src="https://www.recodehive.com/assets/images/architecture-b3efd98872b2ada340c6b0d72e894f38.png" width="1400" height="477" class="img_ev3q">
<em>Keystone processes 2 trillion events/day — 3PB ingested, 7PB output daily. Source: Netflix Engineering</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-an-event-exactly">What Is an Event, Exactly?<a href="https://www.recodehive.com/blog/netflix-data-engineering#what-is-an-event-exactly" class="hash-link" aria-label="Direct link to What Is an Event, Exactly?" title="Direct link to What Is an Event, Exactly?" translate="no">​</a></h2>
<p>An event is a small structured record, typically a few kilobytes that captures a single thing that happened. Every event at Netflix carries a consistent set of core fields:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"event_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">   </span><span class="token string" style="color:#e3116c">"uuid-1234-abcd"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"event_type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"play_start"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"user_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">    </span><span class="token string" style="color:#e3116c">"u_98765432"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"device_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">  </span><span class="token string" style="color:#e3116c">"d_iPhone15"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"title_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">   </span><span class="token string" style="color:#e3116c">"t_StrangerThings_S4E1"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"timestamp"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">  </span><span class="token string" style="color:#e3116c">"2026-05-04T18:32:11.452Z"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"session_id"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"s_abc123"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"region"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">     </span><span class="token string" style="color:#e3116c">"IN"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"quality"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">    </span><span class="token string" style="color:#e3116c">"1080p"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"network"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain">    </span><span class="token string" style="color:#e3116c">"WiFi"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></div></code></pre></div></div>
<p>Netflix generates hundreds of distinct event types across all its services:</p>
<ul>
<li class=""><code>play_start</code>, <code>play_pause</code>, <code>play_stop</code>, <code>seek</code></li>
<li class=""><code>search_query</code>, <code>search_result_click</code></li>
<li class=""><code>scroll_position</code>, <code>title_hovered</code>, <code>row_impression</code></li>
<li class=""><code>buffer_start</code>, <code>buffer_end</code>, <code>quality_change</code></li>
<li class=""><code>error_occurred</code>, <code>playback_failed</code></li>
<li class=""><code>ab_test_assignment</code>, <code>recommendation_shown</code></li>
</ul>
<p>Each event type has its own schema, its own set of required and optional fields, data types, and validation rules. Managing thousands of schemas across hundreds of microservice teams is itself a major engineering problem. That's exactly what the Schema Registry (covered below) was built to solve.</p>
<p>The event above looks simple. But when you're ingesting 12.5 million of them every second, the engineering required to make that reliable without data loss, without duplicates, without schema corruption is anything but simple.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-keystone-kafka-and-flink">The Architecture: Keystone, Kafka, and Flink<a href="https://www.recodehive.com/blog/netflix-data-engineering#the-architecture-keystone-kafka-and-flink" class="hash-link" aria-label="Direct link to The Architecture: Keystone, Kafka, and Flink" title="Direct link to The Architecture: Keystone, Kafka, and Flink" translate="no">​</a></h2>
<p>Before diving into individual tools, watch this first. Flink Forward's breakdown gives you the visual mental model that makes the rest of this article click into place:</p>
<iframe width="100%" height="400" src="https://www.youtube.com/embed/lC0d3gAPXaI" title="Netflix Data Engineering with Apache Flink" style="border:none"></iframe>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="keystone-the-platform-that-wraps-everything">Keystone: The Platform That Wraps Everything<a href="https://www.recodehive.com/blog/netflix-data-engineering#keystone-the-platform-that-wraps-everything" class="hash-link" aria-label="Direct link to Keystone: The Platform That Wraps Everything" title="Direct link to Keystone: The Platform That Wraps Everything" translate="no">​</a></h3>
<p>Most articles jump straight to Kafka and Flink. But the important thing to understand first is <strong>Keystone :</strong> the internal platform that manages the entire pipeline as a service.</p>
<p>Keystone is not a single open-source tool. It's Netflix's purpose-built <strong>Stream Processing as a Service (SPaaS)</strong> platform built on top of Kafka and Flink. It provides:</p>
<ul>
<li class="">A <strong>Data Pipeline layer</strong>: handles event ingestion, routing, and delivery to all downstream sinks (S3, Elasticsearch, secondary Kafka topics)</li>
<li class="">A <strong>Stream Processing layer</strong>: lets any Netflix engineering team deploy and run custom Flink jobs without managing the underlying infrastructure themselves</li>
<li class="">A <strong>Control Plane</strong>: manages job configuration, deployment via Spinnaker, health monitoring, and self-healing. Every job's desired state is stored in AWS RDS, if a Kafka cluster goes down, it can be fully reconstructed from RDS alone</li>
</ul>
<p>Think of Keystone as the operating system for data at Netflix. Kafka and Flink are the engines. Keystone is the layer that makes them usable, self-service, and reliable across thousands of internal teams.</p>
<blockquote>
<p>📖 <a href="https://netflixtechblog.com/keystone-real-time-stream-processing-platform-a3ee651812a" target="_blank" rel="noopener noreferrer" class="">Keystone Real-time Stream Processing Platform — Netflix Tech Blog</a></p>
</blockquote>
<p>The full pipeline architecture:</p>
<p><img decoding="async" loading="lazy" alt="full pipeline" src="https://www.recodehive.com/assets/images/full-pipeline_architecture-3c3f77018c909376e2e1c1e141abf54e.png" width="3599" height="3575" class="img_ev3q"></p>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="layer-1-event-capture-suro-and-the-api-gateway">Layer 1: Event Capture: Suro and the API Gateway<a href="https://www.recodehive.com/blog/netflix-data-engineering#layer-1-event-capture-suro-and-the-api-gateway" class="hash-link" aria-label="Direct link to Layer 1: Event Capture: Suro and the API Gateway" title="Direct link to Layer 1: Event Capture: Suro and the API Gateway" translate="no">​</a></h3>
<p>When a Netflix microservice emits an event, it has two paths into Kafka:
style="border: none;"</p>
<ol>
<li class=""><strong>Direct Kafka write</strong> via a Java client library, for high-throughput services that need maximum speed</li>
<li class=""><strong>HTTP POST via Suro :</strong>  Netflix's internal event collection proxy for services in Python or other languages</li>
</ol>
<p>Both paths end at the same place: a Kafka topic. The critical design principle here is <strong>capture first, process never at the entry point.</strong> The gateway does minimal validation, is the schema registered? does the payload match? and then writes immediately. No enrichment, no business logic, no database calls.</p>
<p>At 12.5 million events per second, even a 1-millisecond database call per event would require 12,500 concurrent database operations per second at the gateway alone. Keeping the entry point stateless is what makes the pipeline scale.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="layer-2-apache-kafka-the-heart-of-the-pipeline">Layer 2: Apache Kafka: The Heart of the Pipeline<a href="https://www.recodehive.com/blog/netflix-data-engineering#layer-2-apache-kafka-the-heart-of-the-pipeline" class="hash-link" aria-label="Direct link to Layer 2: Apache Kafka: The Heart of the Pipeline" title="Direct link to Layer 2: Apache Kafka: The Heart of the Pipeline" translate="no">​</a></h3>
<p><a href="https://kafka.apache.org/" target="_blank" rel="noopener noreferrer" class="">Apache Kafka</a> is the backbone of Keystone. Every event from every microservice flows through Kafka before going anywhere else.</p>
<p><strong>Topic-per-event-type architecture:</strong></p>
<p>Netflix follows a strict rule: <em>one Kafka topic per event type.</em> Hundreds of topics run in parallel — <code>play_events</code>, <code>search_events</code>, <code>error_events</code>, <code>quality_events</code>, and so on. This isolation means a spike in error events during an outage doesn't slow down play event processing, and each topic can have its own retention policy, replication factor, and partition count independently tuned.</p>
<p><strong>Durability profiles:</strong></p>
<p>Netflix configures Kafka with different durability levels depending on how critical the data is. For AP (Availability over Consistency) use cases - analytics events where losing a tiny fraction is acceptable, they allow unclean leader election, trading perfect consistency for never going down. For CP (Consistency over Availability) use cases - billing events, legal audit logs, they require clean leader election with no data loss possible.</p>
<p><strong>Avro + Schema Registry - the data contract:</strong></p>
<p>Every event in Kafka is encoded in <strong>Apache Avro</strong>, a compact binary format that is 3-5x smaller than JSON and significantly faster to parse. But more importantly, every Avro schema is registered in a centralised <strong>Schema Registry</strong> before any event can be written.</p>
<p>When a team deploys a bad change that sends a malformed event - wrong field type, missing required field, Kafka rejects it at the producer. It never enters the pipeline. At 2 trillion events per day, an undetected schema mismatch could corrupt petabytes of downstream data before anyone notices. Schema enforcement at the source is what prevents this.</p>
<blockquote>
<p>📖 <a href="https://www.confluent.io/blog/how-kafka-is-used-by-netflix/" target="_blank" rel="noopener noreferrer" class="">How Netflix Uses Kafka for Distributed Streaming — Confluent</a></p>
</blockquote>
<p><img decoding="async" loading="lazy" alt="Apache Kafka topic architecture showing multiple topics with partitions and parallel consumer groups" src="https://www.recodehive.com/assets/images/kafka_topics-b801d053c0e009cfc030cc40626abbb2.png" width="1536" height="1024" class="img_ev3q">
<em>Kafka organises events into topics with partitions — parallel consumption by multiple downstream systems simultaneously. Source: Conduktor</em></p>
<p><strong>Retention and replay:</strong></p>
<p>Kafka doesn't store events forever. Netflix sets retention policies per topic, high-volume topics might retain data for hours, lower-volume ones for days. The safety net: all Kafka records are also persisted to <strong>Apache Iceberg</strong> tables on S3. If a downstream Flink job fails and needs to reprocess events that have already expired from Kafka, it reads from Iceberg instead. The pipeline is fully replayable.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="layer-3---apache-flink-where-raw-events-become-useful-data">Layer 3 - Apache Flink: Where Raw Events Become Useful Data<a href="https://www.recodehive.com/blog/netflix-data-engineering#layer-3---apache-flink-where-raw-events-become-useful-data" class="hash-link" aria-label="Direct link to Layer 3 - Apache Flink: Where Raw Events Become Useful Data" title="Direct link to Layer 3 - Apache Flink: Where Raw Events Become Useful Data" translate="no">​</a></h3>
<p>Kafka stores and delivers events reliably. But events in a queue don't power recommendations or dashboards. They need to be processed and that's <a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer" class="">Apache Flink</a>'s job.</p>
<p>Flink jobs run continuously, 24/7, consuming from Kafka topics in near real time. A typical Flink job in Keystone runs this chain of operations:</p>
<p><strong>Filter →</strong> Remove noise: system health pings, internal test events, bot traffic, malformed records that slipped past schema validation.</p>
<p><strong>Enrich →</strong> A raw <code>play_start</code> event only contains <code>user_id</code>, <code>title_id</code>, and <code>timestamp</code>. Downstream systems need the show's genre, the user's country, the content rating. Flink enriches events by joining with <strong>side inputs</strong>, a small reference datasets loaded into Flink task memory, so enrichment happens locally without any network calls.</p>
<p><strong>Deduplicate →</strong> Devices retry failed requests. The same event can arrive in Kafka twice. Flink maintains a short time-window buffer in <strong>RocksDB</strong> (an embedded key-value store local to each Flink task), comparing event IDs and dropping duplicates before they reach storage.</p>
<p><strong>Transform →</strong> Reshape the enriched event into the exact schema that each downstream storage system expects.</p>
<p><strong>Window →</strong> Aggregate events across time. <em>"Count all <code>play_start</code> events in the last 60 seconds, grouped by country and device type."</em> This is how Netflix's real-time operations dashboards get live numbers updated every minute.</p>
<p><strong>The 1:1 lesson Netflix learned the hard way:</strong></p>
<p>Netflix initially tried one monolithic Flink job consuming all Kafka topics. It was a disaster. Different topics have wildly different volumes and burst patterns, play events spike on Friday evenings, error events spike during CDN outages making it impossible to tune a single job for all of them without constant instability.</p>
<p>Their solution: <strong>one dedicated Flink job per Kafka topic.</strong> More jobs to operate, but each can be independently scaled, monitored, and tuned. A problem in the <code>error_events</code> Flink job doesn't affect the <code>play_events</code> Flink job. This is a real architectural lesson: operational simplicity at the individual job level outweighs the overhead of managing more jobs.</p>
<blockquote>
<p>📖 <a href="https://www.infoq.com/articles/netflix-migrating-stream-processing/" target="_blank" rel="noopener noreferrer" class="">Migrating Batch ETL to Stream Processing at Netflix — InfoQ</a></p>
</blockquote>
<p><img decoding="async" loading="lazy" src="https://nightlies.apache.org/flink/flink-docs-release-1.17/fig/program_dataflow.svg" alt="Apache Flink dataflow diagram showing a Kafka source feeding into filter, enrich, and transform operators writing to Cassandra and S3" class="img_ev3q">
<em>A Flink job pipeline: events enter from Kafka, flow through processing operators, and are written to storage sinks. Source: Apache Flink Docs</em></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="layer-4---storage-three-databases-three-jobs">Layer 4 - Storage: Three Databases, Three Jobs<a href="https://www.recodehive.com/blog/netflix-data-engineering#layer-4---storage-three-databases-three-jobs" class="hash-link" aria-label="Direct link to Layer 4 - Storage: Three Databases, Three Jobs" title="Direct link to Layer 4 - Storage: Three Databases, Three Jobs" translate="no">​</a></h3>
<p>Processed events are routed to three different storage systems depending on how they'll be accessed:</p>
<p><strong>Apache Cassandra - for millisecond reads at scale:</strong>
Powers anything that needs to be fast, your Continue Watching row, personalised home screen, real-time recommendation updates. Cassandra is a distributed NoSQL database with no single point of failure, designed for massive write throughput. Netflix's Cassandra deployment spans thousands of nodes across multiple clusters and scales linearly.</p>
<p><strong>Apache Iceberg on S3 - for analytical queries:</strong>
Long-term storage for ML model training, A/B test analysis, and content strategy decisions. Iceberg adds ACID transactions, time travel, and schema evolution on top of cheap object storage. The same data that flowed through Kafka and Flink in real time is also persisted here for batch processing. It's also the replay source when Kafka retention expires.</p>
<blockquote>
<p>📖 <a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer" class="">Apache Iceberg — the open table format</a></p>
</blockquote>
<p><strong>Elasticsearch - for observability:</strong>
Operational events, errors, latency spikes, quality degradations are indexed here and power Netflix's internal engineering dashboards. When an on-call engineer needs to know "how many buffering events happened in the last 5 minutes in Southeast Asia," they're querying Elasticsearch.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="connecting-the-tech-to-real-ux">Connecting the Tech to Real UX<a href="https://www.recodehive.com/blog/netflix-data-engineering#connecting-the-tech-to-real-ux" class="hash-link" aria-label="Direct link to Connecting the Tech to Real UX" title="Direct link to Connecting the Tech to Real UX" translate="no">​</a></h2>
<p>Here's what all of this actually produces for a real Netflix user:</p>
<p><strong>Your home screen is personalised in near real time.</strong> Every show you watch, every row you scroll past, every search you run — these events flow through Keystone within seconds and update your taste profile in Cassandra. The next time you open Netflix, the home screen reflects what you did in the last hour, not just your all-time history.</p>
<p><strong>Thumbnails change based on what works for you personally.</strong> Netflix runs thousands of A/B thumbnail tests simultaneously. The event pipeline tracks which thumbnails led to a play and which were ignored and automatically serves the winning variant to users with similar taste profiles. All measured through events.</p>
<p><strong>Video quality adjusts seamlessly before you notice.</strong> Quality-change events flow through Kafka and Flink in milliseconds. When Netflix detects your connection degrading, the pipeline routes a signal to the playback service before your buffer empties. You never see a spinner.</p>
<p><strong>Content decisions are driven by event data.</strong> Which shows do people abandon after episode 1? Which genres drive subscription upgrades in specific markets? This runs as Spark batch jobs on Iceberg tables, billions of events informing which content Netflix commissions and licenses next.</p>
<p><img decoding="async" loading="lazy" alt="Netflix home screen showing personalised rows powered by real-time event pipeline - Top Picks, Continue Watching, Trending Now" src="https://www.recodehive.com/assets/images/homescreen-709db9e4ef2f8e0c475684febe242ca4.png" width="1920" height="1193" class="img_ev3q">
<em>Every row on your home screen — Top Picks, Continue Watching, Trending — is powered by events processed through Keystone in near real time. Source: Netflix</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="5-lessons-for-your-own-data-pipeline">5 Lessons for Your Own Data Pipeline<a href="https://www.recodehive.com/blog/netflix-data-engineering#5-lessons-for-your-own-data-pipeline" class="hash-link" aria-label="Direct link to 5 Lessons for Your Own Data Pipeline" title="Direct link to 5 Lessons for Your Own Data Pipeline" translate="no">​</a></h2>
<p>Netflix's pipeline wasn't built in a day, it evolved through failures, rewrites, and hard-won production lessons over more than a decade. Here are five principles every data engineer can apply at any scale:</p>
<p><strong>1. Capture first, process never at ingestion.</strong>
Your event collection layer should do one thing: receive events and write them to a durable queue. No enrichment, no business logic, no database calls at the entry point. Anything you add there compounds into a bottleneck at scale. Keep ingestion stateless and fast.</p>
<p><strong>2. Schema enforcement is your safety net, invest early.</strong>
At any meaningful scale, a single bad deploy can silently corrupt your entire pipeline without schema validation. Invest in a Schema Registry before you need it. Avro or Protobuf with centralised validation means malformed events are rejected at the source, not discovered days later in broken downstream tables when the damage is already done.</p>
<p><strong>3. One job per topic beats one monolith for all topics.</strong>
If you're using Flink or Spark Streaming, resist the temptation to build one big job that handles everything. Separate topics have different volumes, burst patterns, and latency requirements. A dedicated job per topic means you can tune, scale, monitor, and fix each independently and a failure in one doesn't cascade to others.</p>
<p><strong>4. Match storage to access pattern, not convenience.</strong>
Cassandra for millisecond point reads. Iceberg or Delta Lake for analytical queries over billions of rows. Elasticsearch for full-text and observability queries. These are not interchangeable. The most common mistake is picking one database for everything and then wondering why queries are slow. Design your storage tier around query patterns first.</p>
<p><strong>5. Build for replay from day one.</strong>
Pipelines fail. Jobs crash. Kafka topics expire. If you can't reprocess historical events, every failure is permanent data loss. Before you ship your first pipeline, answer: <em>if this job needs to reprocess last week's data tomorrow, where does it read from?</em> Netflix answers this with Iceberg as the replay source. You need your own answer before you go live.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-numbers-in-context">The Numbers, In Context<a href="https://www.recodehive.com/blog/netflix-data-engineering#the-numbers-in-context" class="hash-link" aria-label="Direct link to The Numbers, In Context" title="Direct link to The Numbers, In Context" translate="no">​</a></h2>
<table><thead><tr><th>Metric</th><th>Value</th></tr></thead><tbody><tr><td>Daily events processed</td><td>2 trillion</td></tr><tr><td>Data ingested per day</td><td>3 petabytes</td></tr><tr><td>Data output per day</td><td>7 petabytes</td></tr><tr><td>Peak throughput</td><td>12.5 million events/second</td></tr><tr><td>Subscribers generating events</td><td>300M+ across 190 countries</td></tr><tr><td>Kafka topics</td><td>Hundreds, one per event type</td></tr></tbody></table>
<p>Every number here represents a real engineering constraint that forced a specific architectural choice. The scale is impressive. The principles behind it are what actually matter.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="wrapping-up">Wrapping Up<a href="https://www.recodehive.com/blog/netflix-data-engineering#wrapping-up" class="hash-link" aria-label="Direct link to Wrapping Up" title="Direct link to Wrapping Up" translate="no">​</a></h2>
<p>The next time Netflix recommends something that feels uncomfortably accurate, or your video quality silently adjusts on a slow connection, or your Continue Watching row picks up exactly where you left off on a different device, that's 2 trillion events per day, flowing through Keystone, processed by Flink, stored in Cassandra and Iceberg, translating raw user actions into a product experience that feels effortless.</p>
<p>The pipeline is invisible. That's exactly the point.</p>
<p>For data engineers, the real takeaway isn't the scale. It's the principles. Capture fast. Enforce schemas. Separate concerns. Match storage to access patterns. Build for replay. These apply whether you're handling 2 trillion events or 2 thousand.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="references--further-reading">References &amp; Further Reading<a href="https://www.recodehive.com/blog/netflix-data-engineering#references--further-reading" class="hash-link" aria-label="Direct link to References &amp; Further Reading" title="Direct link to References &amp; Further Reading" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://netflixtechblog.com/keystone-real-time-stream-processing-platform-a3ee651812a" target="_blank" rel="noopener noreferrer" class="">Keystone Real-time Stream Processing Platform — Netflix Tech Blog</a></li>
<li class=""><a href="https://netflixtechblog.com/how-and-why-netflix-built-a-real-time-distributed-graph-part-1-ingesting-and-processing-data-80113e124acc" target="_blank" rel="noopener noreferrer" class="">How Netflix Built a Real-Time Distributed Graph — Netflix Tech Blog</a></li>
<li class=""><a href="https://www.confluent.io/blog/how-kafka-is-used-by-netflix/" target="_blank" rel="noopener noreferrer" class="">How Netflix Uses Kafka for Distributed Streaming — Confluent</a></li>
<li class=""><a href="https://www.infoq.com/articles/netflix-migrating-stream-processing/" target="_blank" rel="noopener noreferrer" class="">Migrating Batch ETL to Stream Processing at Netflix — InfoQ</a></li>
<li class=""><a href="https://zhenzhongxu.com/the-four-innovation-phases-of-netflixs-trillions-scale-real-time-data-infrastructure-2370938d7f01" target="_blank" rel="noopener noreferrer" class="">The Four Innovation Phases of Netflix's Trillions Scale Data Infrastructure — Medium</a></li>
<li class=""><a href="https://quickbooks-engineering.intuit.com/lessons-learnt-from-netflix-keystone-pipeline-with-trillions-of-daily-messages-64cc91b3c8ea" target="_blank" rel="noopener noreferrer" class="">Lessons Learned from Netflix Keystone Pipeline — Intuit Engineering</a></li>
<li class=""><a href="https://kafka.apache.org/documentation/" target="_blank" rel="noopener noreferrer" class="">Apache Kafka Documentation</a></li>
<li class=""><a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer" class="">Apache Flink Documentation</a></li>
<li class=""><a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer" class="">Apache Iceberg Documentation</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/netflix-data-engineering#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p>I'm <strong>Aditya Singh Rathore</strong>, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, system design, and real-world architectures on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a>, breaking down complex systems into concepts anyone can learn from.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Building a real-time pipeline? Drop your questions in the comments below.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>netflix</category>
            <category>data-engineering</category>
            <category>kafka</category>
            <category>apache-flink</category>
            <category>real-time</category>
            <category>event-streaming</category>
            <category>data-pipeline</category>
            <category>cassandra</category>
            <category>avro</category>
            <category>iceberg</category>
            <category>keystone</category>
        </item>
        <item>
            <title><![CDATA[How SSO Works - Case Study]]></title>
            <link>https://www.recodehive.com/blog/single-sign-on</link>
            <guid>https://www.recodehive.com/blog/single-sign-on</guid>
            <pubDate>Mon, 04 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[SSO lets you log into dozens of apps with a single set of credentials. But how does it actually work under the hood? This beginner-friendly guide walks through the full flow — from clicking "Sign in with Google" to getting access — step by step.]]></description>
            <content:encoded><![CDATA[<p>You've done this a hundred times without thinking about it.</p>
<p>You land on a website, maybe LinkedIn, maybe Spotify, maybe some random productivity app and instead of creating yet another account with yet another password, you just click <strong>"Sign in with Google."</strong></p>
<p>Two seconds later, you're in.</p>
<p>No new password. No verification email. No "must contain one uppercase, one number, and the soul of a forgotten god." Just... in.</p>
<p>That's <strong>Single Sign-On (SSO)</strong> at work. And once you understand how it actually works under the hood, you'll see it everywhere.</p>
<p><img decoding="async" loading="lazy" alt="SSO Flow" src="https://www.recodehive.com/assets/images/SSO-18cf05a68856cf3a48376083df9dee91.png" width="1536" height="1024" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-master-key-analogy">The Master Key Analogy<a href="https://www.recodehive.com/blog/single-sign-on#the-master-key-analogy" class="hash-link" aria-label="Direct link to The Master Key Analogy" title="Direct link to The Master Key Analogy" translate="no">​</a></h2>
<p>Think of SSO like a master key for a hotel.</p>
<p>Every room in the hotel has its own lock - the gym, the pool, the restaurant, your room on the 7th floor. Normally, you'd need a separate key for each one. That would be exhausting.</p>
<p>Instead, the front desk gives you one key card when you check in. That single card opens every door you're allowed through, for the entire stay.</p>
<p>SSO works the same way. You prove who you are once. Everything else just opens.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="two-characters-you-need-to-know">Two Characters You Need to Know<a href="https://www.recodehive.com/blog/single-sign-on#two-characters-you-need-to-know" class="hash-link" aria-label="Direct link to Two Characters You Need to Know" title="Direct link to Two Characters You Need to Know" translate="no">​</a></h2>
<p>Before we walk through the login flow, meet the two players involved:</p>
<ul>
<li class="">
<ol>
<li class=""><strong>Identity Provider (IdP)</strong> - This is the entity that <em>knows who you are</em>. Google, Microsoft, Apple - these are common Identity Providers. They hold your credentials and vouch for your identity.</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class=""><strong>Service Provider (SP)</strong> - This is the app or website you're actually trying to use. LinkedIn, GitHub, Notion, Slack - these are Service Providers. They don't store your password. They just trust the Identity Provider's word.</li>
</ol>
</li>
</ul>
<p>The whole dance of SSO happens between these two.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-it-actually-works-step-by-step">How It Actually Works: Step by Step<a href="https://www.recodehive.com/blog/single-sign-on#how-it-actually-works-step-by-step" class="hash-link" aria-label="Direct link to How It Actually Works: Step by Step" title="Direct link to How It Actually Works: Step by Step" translate="no">​</a></h2>
<p>Let's walk through a real example - logging into LinkedIn using Google.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-1---you-knock-on-the-door">Step 1 - You knock on the door<a href="https://www.recodehive.com/blog/single-sign-on#step-1---you-knock-on-the-door" class="hash-link" aria-label="Direct link to Step 1 - You knock on the door" title="Direct link to Step 1 - You knock on the door" translate="no">​</a></h3>
<p>You visit LinkedIn and click <strong>"Sign in with Google."</strong></p>
<p>LinkedIn (the Service Provider) doesn't ask for your password. Instead, it says: <em>"I don't know this person. Let me send them to Google."</em></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-2---linkedin-redirects-you-to-google">Step 2 - LinkedIn redirects you to Google<a href="https://www.recodehive.com/blog/single-sign-on#step-2---linkedin-redirects-you-to-google" class="hash-link" aria-label="Direct link to Step 2 - LinkedIn redirects you to Google" title="Direct link to Step 2 - LinkedIn redirects you to Google" translate="no">​</a></h3>
<p>LinkedIn sends you over to Google with an authentication request — essentially a note that says: <em>"Hey Google, can you confirm who this person is?"</em></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-3---google-checks-if-youre-already-logged-in">Step 3 - Google checks if you're already logged in<a href="https://www.recodehive.com/blog/single-sign-on#step-3---google-checks-if-youre-already-logged-in" class="hash-link" aria-label="Direct link to Step 3 - Google checks if you're already logged in" title="Direct link to Step 3 - Google checks if you're already logged in" translate="no">​</a></h3>
<p>Google (the Identity Provider) looks for an active session on your browser.</p>
<ul>
<li class=""><strong>If you're already logged into Google</strong> → it skips straight to step 6. No password needed.</li>
<li class=""><strong>If you're not logged in</strong> → it asks for your credentials.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-4---you-enter-your-google-credentials">Step 4 - You enter your Google credentials<a href="https://www.recodehive.com/blog/single-sign-on#step-4---you-enter-your-google-credentials" class="hash-link" aria-label="Direct link to Step 4 - You enter your Google credentials" title="Direct link to Step 4 - You enter your Google credentials" translate="no">​</a></h3>
<p>You type in your Google email and password. This is the <em>only</em> place your credentials go. LinkedIn never sees them. Ever.</p>
<p>This is actually one of the biggest security wins of SSO — your password lives in one place, with one trusted provider, instead of being scattered across dozens of apps.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-5---google-verifies-who-you-are">Step 5 - Google verifies who you are<a href="https://www.recodehive.com/blog/single-sign-on#step-5---google-verifies-who-you-are" class="hash-link" aria-label="Direct link to Step 5 - Google verifies who you are" title="Direct link to Step 5 - Google verifies who you are" translate="no">​</a></h3>
<p>Google checks your credentials against its own database. If everything matches, it doesn't just let you in — it creates something called an <strong>authentication token</strong> (think of it as a signed, digital stamp of approval).</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-6---google-sends-that-token-back-to-linkedin">Step 6 - Google sends that token back to LinkedIn<a href="https://www.recodehive.com/blog/single-sign-on#step-6---google-sends-that-token-back-to-linkedin" class="hash-link" aria-label="Direct link to Step 6 - Google sends that token back to LinkedIn" title="Direct link to Step 6 - Google sends that token back to LinkedIn" translate="no">​</a></h3>
<p>Google hands the token to LinkedIn. The token essentially says: <em>"This person is who they say they are. I, Google, can confirm it."</em></p>
<p>LinkedIn trusts Google's word, reads the token, and lets you in — without ever having touched your password.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-7---the-magic-of-the-existing-session">Step 7 - The magic of the existing session<a href="https://www.recodehive.com/blog/single-sign-on#step-7---the-magic-of-the-existing-session" class="hash-link" aria-label="Direct link to Step 7 - The magic of the existing session" title="Direct link to Step 7 - The magic of the existing session" translate="no">​</a></h3>
<p>Here's where SSO really earns its name.</p>
<p>Later that day, you open GitHub and click "Sign in with Google." GitHub sends the same authentication request to Google. But this time, Google already has an active session from when you logged into LinkedIn.</p>
<p>So instead of asking for your password again, Google just says: <em>"Yep, I know this person. Here's their token."</em></p>
<p>You're in GitHub instantly. No password. No friction.</p>
<p>One login. Many doors.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-protocols-behind-the-scenes">The Protocols Behind the Scenes<a href="https://www.recodehive.com/blog/single-sign-on#the-protocols-behind-the-scenes" class="hash-link" aria-label="Direct link to The Protocols Behind the Scenes" title="Direct link to The Protocols Behind the Scenes" translate="no">​</a></h2>
<p>SSO isn't magic - it runs on a set of agreed-upon rules that tell the Identity Provider and Service Provider how to talk to each other and how to trust each other. These rules are called <strong>protocols</strong>.</p>
<p>The three most common ones you'll hear about:</p>
<p><strong>SAML (Security Assertion Markup Language)</strong> - the older, enterprise-friendly protocol. You'll find it in corporate SSO setups, think logging into your company's internal tools with your work email.</p>
<p><strong>OpenID Connect</strong> - the modern, developer-friendly protocol built on top of OAuth. This is what powers most "Sign in with Google" buttons you see on consumer apps today.</p>
<p><strong>OAuth</strong> - technically an authorization protocol (not authentication), but often used alongside OpenID Connect. It's what handles the "allow this app to access your Google account" permissions screen.</p>
<p>You don't need to memorize the differences right now. Just know that when SSO works smoothly, one of these protocols is doing the heavy lifting in the background.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-does-any-of-this-matter">Why Does Any of This Matter?<a href="https://www.recodehive.com/blog/single-sign-on#why-does-any-of-this-matter" class="hash-link" aria-label="Direct link to Why Does Any of This Matter?" title="Direct link to Why Does Any of This Matter?" translate="no">​</a></h2>
<p>SSO isn't just a convenience feature. It solves real problems:</p>
<ul>
<li class="">
<ol>
<li class=""><strong>For users:</strong> Fewer passwords to remember means fewer weak passwords, fewer forgotten passwords, and fewer "reset my password" spirals at 11pm.</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class=""><strong>For security teams:</strong> When an employee leaves a company, revoking access to one Identity Provider cuts off access to every connected app instantly — instead of hunting down 30 individual accounts.</li>
</ol>
</li>
<li class="">
<ol start="3">
<li class=""><strong>For developers:</strong> Building an app with SSO means you don't have to manage password storage, reset flows, or authentication security yourself. You offload all of that to a provider like Google or Microsoft that is very, very good at it.</li>
</ol>
</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-one-thing-to-remember">The One Thing to Remember<a href="https://www.recodehive.com/blog/single-sign-on#the-one-thing-to-remember" class="hash-link" aria-label="Direct link to The One Thing to Remember" title="Direct link to The One Thing to Remember" translate="no">​</a></h2>
<p>If you take nothing else from this:</p>
<blockquote>
<p><strong>SSO means you prove your identity once, to one trusted provider, and that proof travels with you across every connected app.</strong></p>
</blockquote>
<p>Next time you click "Sign in with Google," you'll know exactly what's happening behind that button — a quiet handshake between two systems, so you don't have to think about it at all.</p>
<p><em>Enjoyed this? I write about data engineering, system design, and the concepts that actually matter in tech — without the jargon.</em></p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <author>sanjay@recodehive.com (Sanjay Viswanthan)</author>
            <category>sso</category>
            <category>single-sign-on</category>
            <category>authentication</category>
            <category>identity-provider</category>
            <category>oauth</category>
            <category>openid-connect</category>
            <category>saml</category>
            <category>security</category>
            <category>web</category>
        </item>
        <item>
            <title><![CDATA[Delta Lake: An Introduction to Trustworthy Data Storage]]></title>
            <link>https://www.recodehive.com/blog/deltalake-data-storage</link>
            <guid>https://www.recodehive.com/blog/deltalake-data-storage</guid>
            <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, Hive, Snowflake, Google BigQuery, Athena, Redshift, Databricks, Azure Fabric and APIs for Scala, Java, Rust, and Python. With Delta Universal Format aka UniForm, you can read now Delta tables with Iceberg and Hudi clients.]]></description>
            <content:encoded><![CDATA[<p> </p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="there-is-something-wrong-with-your-data-lake">There Is Something Wrong With Your Data Lake<a href="https://www.recodehive.com/blog/deltalake-data-storage#there-is-something-wrong-with-your-data-lake" class="hash-link" aria-label="Direct link to There Is Something Wrong With Your Data Lake" title="Direct link to There Is Something Wrong With Your Data Lake" translate="no">​</a></h2>
<p>Imagine this: your firm receives hundreds of records per hour, be it users signing up for an account, making purchases, or using your mobile application. You store all these records in a data lake, which is hosted on the cloud. Got it?</p>
<p>Now, imagine something happening to this system. Two pipelines write to the same table simultaneously, overwriting each other. And now half of your data is gone. No one notices until it becomes obvious in the weekly report.</p>
<p>The issue described above is a common one when using traditional data lakes. The thing is that data lakes were created to solve a different problem, one of storing information rather than ensuring its reliability.
And that's what <strong>Delta Lake</strong> is designed to solve.</p>
<p><img decoding="async" loading="lazy" alt="delta-lake" src="https://www.recodehive.com/assets/images/delta-lakepng-875e621be9ca3864ec2d5a3aa2963413.png" width="1536" height="1024" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-delta-lake-in-plain-english">What is Delta Lake, in Plain English?<a href="https://www.recodehive.com/blog/deltalake-data-storage#what-is-delta-lake-in-plain-english" class="hash-link" aria-label="Direct link to What is Delta Lake, in Plain English?" title="Direct link to What is Delta Lake, in Plain English?" translate="no">​</a></h2>
<p>Consider a traditional data lake to be a folder in Google Drive, where anyone has the ability to edit or even delete anything inside without leaving an audit trail or version history.
What if that folder was:</p>
<ul>
<li class="">
<ol>
<li class="">Version-controlled and could be rolled back to any previous state</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class="">Guaranteed to have a clean schema</li>
</ol>
</li>
<li class="">
<ol start="3">
<li class="">Structured such that bad data can't possibly get stored</li>
</ol>
</li>
<li class="">
<ol start="4">
<li class="">Secure against race conditions when used by multiple writers</li>
</ol>
</li>
</ul>
<p>This folder would be a Delta Lake. It operates over the storage already provided for your organization and makes all those promises without asking you to move off your storage infrastructure.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-four-unique-features-of-delta-lake">The Four Unique Features of Delta Lake<a href="https://www.recodehive.com/blog/deltalake-data-storage#the-four-unique-features-of-delta-lake" class="hash-link" aria-label="Direct link to The Four Unique Features of Delta Lake" title="Direct link to The Four Unique Features of Delta Lake" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-acid-transactions-corruption-free-data">1. ACID Transactions: Corruption-Free Data!<a href="https://www.recodehive.com/blog/deltalake-data-storage#1-acid-transactions-corruption-free-data" class="hash-link" aria-label="Direct link to 1. ACID Transactions: Corruption-Free Data!" title="Direct link to 1. ACID Transactions: Corruption-Free Data!" translate="no">​</a></h3>
<p>ACID Transactions are <code>Atomicity</code>, <code>Consistency</code>, <code>Isolation</code>, and <code>Durability</code>. It is not mandatory to memorize these terminologies, but it is essential to understand how they operate.
Delta Lake provides us a guarantee that when two processes attempt to modify the same dataset, none of them will overwrite the other's modification. Each process either proceeds or waits for their turn, which gives us consistency in our data like a queue at the cashier.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-time-travel-the-undo-feature">2. Time Travel: The "Undo" Feature<a href="https://www.recodehive.com/blog/deltalake-data-storage#2-time-travel-the-undo-feature" class="hash-link" aria-label="Direct link to 2. Time Travel: The &quot;Undo&quot; Feature" title="Direct link to 2. Time Travel: The &quot;Undo&quot; Feature" translate="no">​</a></h3>
<p>When working with a Delta table, all of your operations are kept in versioning. Accidentally deleted a record? Performed a bad update operation? With the time travel feature, we can revert changes and query the data at any point in time in history of our table.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-schema-enforcement-bad-data-rejection">3. Schema Enforcement: Bad Data Rejection<a href="https://www.recodehive.com/blog/deltalake-data-storage#3-schema-enforcement-bad-data-rejection" class="hash-link" aria-label="Direct link to 3. Schema Enforcement: Bad Data Rejection" title="Direct link to 3. Schema Enforcement: Bad Data Rejection" translate="no">​</a></h3>
<p>Suppose that your schema requires a certain field to only contain numerical values while another client attempts to send you a record that contains a string. In this case, Delta Lake blocks this row from being entered into the dataset.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-schema-evolution--evolving-without-breaking-anything">4. Schema Evolution – Evolving without Breaking Anything<a href="https://www.recodehive.com/blog/deltalake-data-storage#4-schema-evolution--evolving-without-breaking-anything" class="hash-link" aria-label="Direct link to 4. Schema Evolution – Evolving without Breaking Anything" title="Direct link to 4. Schema Evolution – Evolving without Breaking Anything" translate="no">​</a></h3>
<p>As your product matures, so does your data. Want to add an extra column? Delta Lake makes schema evolution easy – your data remains untouched while your workflows continue uninterrupted.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="and-how-exactly-does-that-work">And How Exactly Does That Work?<a href="https://www.recodehive.com/blog/deltalake-data-storage#and-how-exactly-does-that-work" class="hash-link" aria-label="Direct link to And How Exactly Does That Work?" title="Direct link to And How Exactly Does That Work?" translate="no">​</a></h2>
<p>All the magic above happens because of a mechanism known as the Transaction Log, and it’s kept in a folder named <code>_delta_log</code> within your table itself.
Every individual action, be it inserting, deleting, or updating records,  is logged in a JSON format within that log. Delta Lake relies on this transaction log to keep track of the latest status of your table, and which older files can be safely deleted from the system.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="heres-how-your-table-appears-on-the-disk">Here’s how your table appears on the disk:<a href="https://www.recodehive.com/blog/deltalake-data-storage#heres-how-your-table-appears-on-the-disk" class="hash-link" aria-label="Direct link to Here’s how your table appears on the disk:" title="Direct link to Here’s how your table appears on the disk:" translate="no">​</a></h2>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">my_table</span><span class="token operator" style="color:#393A34">/</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── _delta_log</span><span class="token operator" style="color:#393A34">/</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">│   ├── </span><span class="token number" style="color:#36acaa">00000000000000000000</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json   ← </span><span class="token string" style="color:#e3116c">"Table was created"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">│   ├── </span><span class="token number" style="color:#36acaa">00000000000000000001</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json   ← </span><span class="token string" style="color:#e3116c">"10 rows were added"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">│   └── </span><span class="token number" style="color:#36acaa">00000000000000000002</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">json   ← </span><span class="token string" style="color:#e3116c">"Salary column was updated"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── part</span><span class="token operator" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">00001</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">├── part</span><span class="token operator" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">00002</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">parquet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">└── part</span><span class="token operator" style="color:#393A34">-</span><span class="token number" style="color:#36acaa">00003</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">parquet</span><br></div></code></pre></div></div>
<p>The real data is stored in Parquet files, which are highly efficient in terms of querying. The transaction log is the brain, and the Parquet files are the data store..</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="lets-write-some-code">Let's Write Some Code<a href="https://www.recodehive.com/blog/deltalake-data-storage#lets-write-some-code" class="hash-link" aria-label="Direct link to Let's Write Some Code" title="Direct link to Let's Write Some Code" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="setting-up">Setting Up<a href="https://www.recodehive.com/blog/deltalake-data-storage#setting-up" class="hash-link" aria-label="Direct link to Setting Up" title="Direct link to Setting Up" translate="no">​</a></h3>
<div class="language-Python language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pip install delta</span><span class="token operator" style="color:#393A34">-</span><span class="token plain">spark pyspark</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> pyspark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> SparkSession</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> delta </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> configure_spark_with_delta_pip</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">builder </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> SparkSession</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">builder \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">appName</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"MyFirstDeltaTable"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.extensions"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"io.delta.sql.DeltaSparkSessionExtension"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">config</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"spark.sql.catalog.spark_catalog"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"org.apache.spark.sql.delta.catalog.DeltaCatalog"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> configure_spark_with_delta_pip</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">builder</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">getOrCreate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="creating-a-delta-table">Creating a Delta Table<a href="https://www.recodehive.com/blog/deltalake-data-storage#creating-a-delta-table" class="hash-link" aria-label="Direct link to Creating a Delta Table" title="Direct link to Creating a Delta Table" translate="no">​</a></h3>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Let's create a simple employee dataset</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">employees </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Priya Sharma"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Engineering"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">82000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">2</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Liam O'Brien"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Marketing"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">67000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">3</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Yuki Tanaka"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Engineering"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">91000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">4</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Carlos Mendez"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Sales"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">74000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">columns </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token string" style="color:#e3116c">"id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"name"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"department"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"salary"</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">createDataFrame</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">employees</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> columns</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Save it as a Delta table</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">write</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">mode</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"overwrite"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">save</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"/data/employees"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>That's it. You now have a Delta table with a transaction log, version history, and all the reliability features built in automatically.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="reading-it-back">Reading It Back<a href="https://www.recodehive.com/blog/deltalake-data-storage#reading-it-back" class="hash-link" aria-label="Direct link to Reading It Back" title="Direct link to Reading It Back" translate="no">​</a></h3>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"/data/employees"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">show</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">+---+-------------+------------+------+</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">| id|         name|  department|salary|</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">+---+-------------+------------+------+</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">|  1| Priya Sharma| Engineering| 82000|</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">|  2| Liam O'Brien|   Marketing| 67000|</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">|  3|  Yuki Tanaka| Engineering| 91000|</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">|  4|Carlos Mendez|       Sales| 74000|</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">+---+-------------+------------+------+</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="using-time-travel">Using Time Travel<a href="https://www.recodehive.com/blog/deltalake-data-storage#using-time-travel" class="hash-link" aria-label="Direct link to Using Time Travel" title="Direct link to Using Time Travel" translate="no">​</a></h3>
<p>Let's say you update some salaries, then realize the update was wrong:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> delta</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">tables </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> DeltaTable</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">delta_table </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> DeltaTable</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">forPath</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"/data/employees"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Give everyone in Engineering a raise</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">delta_table</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">update</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    condition</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"department = 'Engineering'"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token builtin">set</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token string" style="color:#e3116c">"salary"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"salary + 5000"</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Oops!  turns out that update was wrong. No panic. Just travel back to version 0:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Check the history first</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">delta_table</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">history</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">show</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Read the original data before the update</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">original_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">read \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">format</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"delta"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">option</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"versionAsOf"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> \</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">load</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"/data/employees"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">original_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">show</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>You get your original data back, untouched. You can restore it, compare it, or just use it to figure out what went wrong.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="inserting-and-updating-at-the-same-time-merge">Inserting and Updating at the Same Time (MERGE)<a href="https://www.recodehive.com/blog/deltalake-data-storage#inserting-and-updating-at-the-same-time-merge" class="hash-link" aria-label="Direct link to Inserting and Updating at the Same Time (MERGE)" title="Direct link to Inserting and Updating at the Same Time (MERGE)" translate="no">​</a></h3>
<p>One of the most useful everyday operations is <code>MERGE</code>, often called an upsert.
It means: update the record if it exists, insert it if it doesn't.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Some incoming data -- one update, one brand new employee</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">incoming </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">2</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Liam O'Brien"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Marketing"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">71000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># salary updated</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Amara Osei"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"HR"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">69000</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain">            </span><span class="token comment" style="color:#999988;font-style:italic"># new employee</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">incoming_df </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">createDataFrame</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">incoming</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> columns</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">delta_table</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">alias</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"existing"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">merge</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    incoming_df</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">alias</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"new"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"existing.id = new.id"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">whenMatchedUpdate</span><span class="token punctuation" style="color:#393A34">(</span><span class="token builtin">set</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"salary"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"new.salary"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">whenNotMatchedInsert</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">values</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"id"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">         </span><span class="token string" style="color:#e3116c">"new.id"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"name"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">       </span><span class="token string" style="color:#e3116c">"new.name"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"department"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"new.department"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token string" style="color:#e3116c">"salary"</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">     </span><span class="token string" style="color:#e3116c">"new.salary"</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">execute</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>One operation. No duplicates. No manual checking. Clean results every time.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="keeping-your-table-healthy">Keeping Your Table Healthy<a href="https://www.recodehive.com/blog/deltalake-data-storage#keeping-your-table-healthy" class="hash-link" aria-label="Direct link to Keeping Your Table Healthy" title="Direct link to Keeping Your Table Healthy" translate="no">​</a></h3>
<p>Over time, Delta Lake accumulates old data files for time travel. You'll want to periodically clean those up:</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Remove files older than 7 days</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"VACUUM delta.`/data/employees` RETAIN 168 HOURS"</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">And </span><span class="token keyword" style="color:#00009f">if</span><span class="token plain"> your table gets many small files over time </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">which slows down queries</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> compact them</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">python</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Compact small files into larger, more efficient ones</span><span class="token plain"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">spark</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">sql</span><span class="token punctuation" style="color:#393A34">(</span><span class="token string" style="color:#e3116c">"OPTIMIZE delta.`/data/employees`"</span><span class="token punctuation" style="color:#393A34">)</span><br></div></code></pre></div></div>
<p>Think of <code>VACUUM</code> as taking out the trash and <code>OPTIMIZE</code> as reorganizing your desk. Both are good habits to run on a schedule.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-should-you-utilize-delta-lake">When Should You Utilize Delta Lake?<a href="https://www.recodehive.com/blog/deltalake-data-storage#when-should-you-utilize-delta-lake" class="hash-link" aria-label="Direct link to When Should You Utilize Delta Lake?" title="Direct link to When Should You Utilize Delta Lake?" translate="no">​</a></h2>
<p>Delta Lake is perfect for use when:</p>
<ul>
<li class="">
<ol>
<li class="">There are several pipelines or multiple parties writing to the same data set.</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class="">An audit history of all changes is necessary.</li>
</ol>
</li>
<li class="">
<ol start="3">
<li class="">The schema of your data can change.</li>
</ol>
</li>
<li class="">
<ol start="4">
<li class="">You would like to detect any data that could cause problems.</li>
</ol>
</li>
<li class="">
<ol start="5">
<li class="">Real-time streams and batch historical data are being combined.</li>
</ol>
</li>
</ul>
<p>If you have static files that are never going to be changed, then regular Parquet will be sufficient. However, the second your data becomes dynamic, it's worth its weight in gold.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion">Conclusion<a href="https://www.recodehive.com/blog/deltalake-data-storage#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>In essence, Delta Lake starts with taking the idea of a data lake – low-cost, scalable, and flexible storage – and makes it reliable. The ACID transaction model eliminates silent corruptions, time travel allows you to get back your data on any mistake, while schema enforcement prevents bad data from entering your system, while at the same time schema evolution makes sure your data stack evolves easily.</p>
<p>And at the heart of this system lies nothing else but a transaction log – an easy and audit-ready record of every transaction made to your data.</p>
<p>When it comes to building data pipelines where data quality really matters – which happens sooner or later – Delta Lake cannot be anything else but the base of your stack. But most importantly, it’s very easy to implement.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>deltalake</category>
            <category>storage</category>
            <category>Big Data</category>
            <category>cloud</category>
            <category>Data Engineering</category>
            <category>fabric</category>
        </item>
        <item>
            <title><![CDATA[How I cleared DP-700 Certification Exam]]></title>
            <link>https://www.recodehive.com/blog/fabric-data-engineer</link>
            <guid>https://www.recodehive.com/blog/fabric-data-engineer</guid>
            <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[A comprehensive guide to clearing the Microsoft Fabric Data Engineer Associate (DP-700) certification. Learn the preparation strategy, key concepts, hands-on practice tips, and real exam experience from someone who passed it. Discover why Lakehouse, Delta Tables, Dataflows, and DirectLake mode matter, and how to approach scenario-based questions effectively.]]></description>
            <content:encoded><![CDATA[<p> </p>
<p>If you're a data engineer working in the Microsoft ecosystem, Microsoft Fabric is impossible to ignore , and the DP-700 certification is one of the best ways to prove you understand it. I recently cleared the <strong>Microsoft DP-700: Fabric Data Engineer Associate</strong> exam, and this is an honest breakdown of how I did it, what actually helped, and what you should skip.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-microsoft-fabric-really">What Is Microsoft Fabric, Really?<a href="https://www.recodehive.com/blog/fabric-data-engineer#what-is-microsoft-fabric-really" class="hash-link" aria-label="Direct link to What Is Microsoft Fabric, Really?" title="Direct link to What Is Microsoft Fabric, Really?" translate="no">​</a></h2>
<p>Before diving into the prep strategy, let's quickly address what makes Fabric different.</p>
<p>Microsoft Fabric is not just another Azure tool. It's Microsoft's attempt to merge your <strong>entire modern data stack into a single platform</strong> — data engineering, data science, data warehousing, real-time analytics, and Power BI, all under one roof.</p>
<p>Think of it this way: earlier, you had Azure Data Factory for orchestration, Synapse for warehousing, and Power BI for reporting — three separate tools with separate setups and billing. Fabric brings all of that together in one unified experience.</p>
<p>This shift in architecture is exactly why the DP-700 exam feels different from other Azure certifications. It's not about memorizing service names — it's about understanding <em>how these pieces fit together</em> in real-world data solutions.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-dp-700-exam">About the DP-700 Exam<a href="https://www.recodehive.com/blog/fabric-data-engineer#about-the-dp-700-exam" class="hash-link" aria-label="Direct link to About the DP-700 Exam" title="Direct link to About the DP-700 Exam" translate="no">​</a></h2>
<table><thead><tr><th>Detail</th><th>Info</th></tr></thead><tbody><tr><td><strong>Full Name</strong></td><td>Microsoft Fabric Data Engineer Associate</td></tr><tr><td><strong>Level</strong></td><td>Associate</td></tr><tr><td><strong>Format</strong></td><td>MCQs + Case Studies</td></tr><tr><td><strong>Difficulty</strong></td><td>Medium (concept-heavy, not definition-heavy)</td></tr><tr><td><strong>Focus</strong></td><td>Real-world architecture and decision-making</td></tr></tbody></table>
<p>One important reality check: <strong>this is not a memorization exam.</strong> If you go in trying to rote-learn definitions, the scenario-based questions will catch you off guard. The exam tests whether you can make the right architectural decision — not whether you can recite what a Lakehouse is.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="my-preparation-strategy">My Preparation Strategy<a href="https://www.recodehive.com/blog/fabric-data-engineer#my-preparation-strategy" class="hash-link" aria-label="Direct link to My Preparation Strategy" title="Direct link to My Preparation Strategy" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-microsoft-learn--your-non-negotiable-starting-point">1. Microsoft Learn — Your Non-Negotiable Starting Point<a href="https://www.recodehive.com/blog/fabric-data-engineer#1-microsoft-learn--your-non-negotiable-starting-point" class="hash-link" aria-label="Direct link to 1. Microsoft Learn — Your Non-Negotiable Starting Point" title="Direct link to 1. Microsoft Learn — Your Non-Negotiable Starting Point" translate="no">​</a></h3>
<p>Start here, period. The Microsoft Learn paths for DP-700 are well-structured and align closely with the actual exam topics. They cover all the core concepts across Fabric's components.</p>
<p>That said, Microsoft Learn alone is not enough. Think of it as building your foundation — you still need to put that foundation to work.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-hands-on-practice--the-actual-game-changer">2. Hands-On Practice — The Actual Game Changer<a href="https://www.recodehive.com/blog/fabric-data-engineer#2-hands-on-practice--the-actual-game-changer" class="hash-link" aria-label="Direct link to 2. Hands-On Practice — The Actual Game Changer" title="Direct link to 2. Hands-On Practice — The Actual Game Changer" translate="no">​</a></h3>
<p>This is where most candidates underinvest, and it shows on exam day.</p>
<p>I spent dedicated time:</p>
<ul>
<li class="">Creating and exploring <strong>Lakehouses</strong></li>
<li class="">Building and running <strong>Data Pipelines</strong></li>
<li class="">Working with <strong>Dataflows Gen2</strong></li>
<li class="">Exploring the <strong>Fabric UI</strong> thoroughly (this matters more than you think)</li>
</ul>
<p>Microsoft Fabric has a free trial. Use it. The exam includes scenario questions where you need to navigate or reason about the interface. If you've never seen it, you'll struggle to answer those questions confidently.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-practice-tests--learn-to-eliminate-not-just-recall">3. Practice Tests — Learn to Eliminate, Not Just Recall<a href="https://www.recodehive.com/blog/fabric-data-engineer#3-practice-tests--learn-to-eliminate-not-just-recall" class="hash-link" aria-label="Direct link to 3. Practice Tests — Learn to Eliminate, Not Just Recall" title="Direct link to 3. Practice Tests — Learn to Eliminate, Not Just Recall" translate="no">​</a></h3>
<p>Practice tests serve two purposes. First, they show you where your weak areas are. Second, and more importantly, they teach you how to approach tricky answer options.</p>
<p>Many DP-700 questions have two options that look almost identical. The skill you're actually being tested on is <strong>eliminating the wrong answer</strong> ,not picking the right one from memory. Practice tests train that skill.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-youtube-for-concept-clarity">4. YouTube for Concept Clarity<a href="https://www.recodehive.com/blog/fabric-data-engineer#4-youtube-for-concept-clarity" class="hash-link" aria-label="Direct link to 4. YouTube for Concept Clarity" title="Direct link to 4. YouTube for Concept Clarity" translate="no">​</a></h3>
<p>Whenever a concept didn't fully click after reading, I turned to YouTube. Sometimes a 10-minute video does what 2 hours of documentation can't. Particularly useful for visual concepts like DirectLake mode, Delta Table versioning, and pipeline orchestration flows.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-concepts-you-must-know">Key Concepts You Must Know<a href="https://www.recodehive.com/blog/fabric-data-engineer#key-concepts-you-must-know" class="hash-link" aria-label="Direct link to Key Concepts You Must Know" title="Direct link to Key Concepts You Must Know" translate="no">​</a></h2>
<p>These are the areas that carry the most weight in the exam. If any of these feel unclear, go back and invest time here before moving forward.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="lakehouse">Lakehouse<a href="https://www.recodehive.com/blog/fabric-data-engineer#lakehouse" class="hash-link" aria-label="Direct link to Lakehouse" title="Direct link to Lakehouse" translate="no">​</a></h3>
<p>The Lakehouse is the central concept in Microsoft Fabric. It combines the flexibility of a Data Lake with the structure of a Data Warehouse. If this concept isn't solid, everything built on top of it will feel unstable.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="data-pipelines-vs-dataflows-gen2">Data Pipelines vs. Dataflows Gen2<a href="https://www.recodehive.com/blog/fabric-data-engineer#data-pipelines-vs-dataflows-gen2" class="hash-link" aria-label="Direct link to Data Pipelines vs. Dataflows Gen2" title="Direct link to Data Pipelines vs. Dataflows Gen2" translate="no">​</a></h3>
<p>A common trap in the exam is knowing <em>when</em> to use each:</p>
<ul>
<li class=""><strong>Pipelines</strong> → Orchestration (similar to Azure Data Factory). Use for scheduling, triggering, and controlling the flow of data.</li>
<li class=""><strong>Dataflows Gen2</strong> → Transformation. Use for cleaning, shaping, and preparing data using a Power Query-like interface.</li>
</ul>
<p>The exam loves to test this distinction with scenario questions.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="delta-tables">Delta Tables<a href="https://www.recodehive.com/blog/fabric-data-engineer#delta-tables" class="hash-link" aria-label="Direct link to Delta Tables" title="Direct link to Delta Tables" translate="no">​</a></h3>
<p>Delta Tables are the backbone of storage in Fabric. Key areas to understand:</p>
<ul>
<li class="">ACID transaction support</li>
<li class="">Time travel and versioning</li>
<li class="">How Delta integrates with the Lakehouse</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="power-bi-and-directlake-mode">Power BI and DirectLake Mode<a href="https://www.recodehive.com/blog/fabric-data-engineer#power-bi-and-directlake-mode" class="hash-link" aria-label="Direct link to Power BI and DirectLake Mode" title="Direct link to Power BI and DirectLake Mode" translate="no">​</a></h3>
<p>DirectLake is one of Fabric's most important innovations — it allows Power BI to query data directly from the Lakehouse without importing it, while still delivering near-import performance. This appears in multiple exam scenarios.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="workspace-and-security-model">Workspace and Security Model<a href="https://www.recodehive.com/blog/fabric-data-engineer#workspace-and-security-model" class="hash-link" aria-label="Direct link to Workspace and Security Model" title="Direct link to Workspace and Security Model" translate="no">​</a></h3>
<p>Understand roles, permissions, and how access is managed across Fabric items. Security-related questions appear more than people expect.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="my-study-timeline">My Study Timeline<a href="https://www.recodehive.com/blog/fabric-data-engineer#my-study-timeline" class="hash-link" aria-label="Direct link to My Study Timeline" title="Direct link to My Study Timeline" translate="no">​</a></h2>
<p>This is what actually happened — not an ideal plan, but an honest one:</p>
<ul>
<li class=""><strong>Week 1</strong> — Went through Microsoft Learn modules and explored the Fabric UI (a lot of clicking around to understand the platform)</li>
<li class=""><strong>Week 2</strong> — Hands-on practice: built pipelines, created Lakehouses, ran Dataflows, explored Delta Tables</li>
<li class=""><strong>Week 3</strong> — Practice tests, identified weak areas, revised those topics, and did a final pass on key concepts</li>
</ul>
<p>Some days I studied 3–4 focused hours. Some days were slower. Consistency over intensity is what got me through.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="exam-day--what-it-actually-felt-like">Exam Day — What It Actually Felt Like<a href="https://www.recodehive.com/blog/fabric-data-engineer#exam-day--what-it-actually-felt-like" class="hash-link" aria-label="Direct link to Exam Day — What It Actually Felt Like" title="Direct link to Exam Day — What It Actually Felt Like" translate="no">​</a></h2>
<p>Here's a realistic walkthrough of the experience:</p>
<ul>
<li class=""><strong>First few questions</strong>: Straightforward — concepts you've covered</li>
<li class=""><strong>Middle section</strong>: Scenario-based questions where two options look very similar. This is where hands-on familiarity pays off.</li>
<li class=""><strong>Case studies</strong>: Time-consuming but manageable if you understand architecture well</li>
<li class=""><strong>End section</strong>: A few questions that feel unexpected — stay calm, apply what you know</li>
</ul>
<p>Key observations from exam day:</p>
<ul>
<li class="">Time management matters. Don't spend 10 minutes on one question.</li>
<li class="">Read each question fully before looking at options.</li>
<li class="">Scenario questions reward understanding, not recall.</li>
</ul>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-to-do-and-what-to-avoid">What to Do (and What to Avoid)<a href="https://www.recodehive.com/blog/fabric-data-engineer#what-to-do-and-what-to-avoid" class="hash-link" aria-label="Direct link to What to Do (and What to Avoid)" title="Direct link to What to Do (and What to Avoid)" translate="no">​</a></h2>
<p><strong>Do this:</strong></p>
<ul>
<li class="">Practice hands-on inside Fabric (free trial is available)</li>
<li class="">Understand the <em>why</em> behind architectural choices, not just what each component does</li>
<li class="">Learn from practice test mistakes — review every wrong answer</li>
<li class="">Revise your weak areas before the exam, not your strong areas</li>
</ul>
<p><strong>Avoid this:</strong></p>
<ul>
<li class="">Trying to memorize definitions — the exam will test application, not recall</li>
<li class="">Skipping the UI experience — you need to recognize Fabric's interface</li>
<li class="">Ignoring practice tests — they're the closest thing to the real exam experience</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="is-dp-700-worth-it">Is DP-700 Worth It?<a href="https://www.recodehive.com/blog/fabric-data-engineer#is-dp-700-worth-it" class="hash-link" aria-label="Direct link to Is DP-700 Worth It?" title="Direct link to Is DP-700 Worth It?" translate="no">​</a></h2>
<p><strong>Yes, if:</strong></p>
<ul>
<li class="">You're a data engineer or data professional working with Microsoft technologies</li>
<li class="">You're building or designing modern data platforms</li>
<li class="">You want to position yourself for roles that involve Microsoft Fabric, Synapse, or Power BI</li>
</ul>
<p><strong>Not essential if:</strong></p>
<ul>
<li class="">You have no plans to work in the Microsoft data ecosystem</li>
<li class="">You're focused on non-data engineering roles</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="final-thoughts">Final Thoughts<a href="https://www.recodehive.com/blog/fabric-data-engineer#final-thoughts" class="hash-link" aria-label="Direct link to Final Thoughts" title="Direct link to Final Thoughts" translate="no">​</a></h2>
<p>Microsoft Fabric is still maturing, but its direction is clear — Microsoft is consolidating the modern data stack into a single platform, and it's gaining adoption fast. Understanding Fabric deeply, not just passing an exam on it, is genuinely useful right now.</p>
<p>The DP-700 is a solid way to validate that understanding. Approach it with real hands-on practice and a focus on concepts over definitions, and you'll be in a good position on exam day.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="useful-resources">Useful Resources<a href="https://www.recodehive.com/blog/fabric-data-engineer#useful-resources" class="hash-link" aria-label="Direct link to Useful Resources" title="Direct link to Useful Resources" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://learn.microsoft.com/en-us/credentials/certifications/fabric-analytics-engineer-associate/" target="_blank" rel="noopener noreferrer" class="">Microsoft Learn — DP-700 Study Guide</a></li>
<li class=""><a href="https://app.fabric.microsoft.com/" target="_blank" rel="noopener noreferrer" class="">Microsoft Fabric Free Trial</a></li>
<li class=""><a href="https://www.recodehive.com/docs/" target="_blank" rel="noopener noreferrer" class="">RecodeHive — Data Engineering Tutorials</a></li>
</ul>
<hr>
<p><em>Have questions about DP-600 prep or Microsoft Fabric? Drop a comment below — happy to help.</em></p>
<p><em>Connect on <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></em></p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>DP-700</category>
            <category>azure</category>
            <category>Big Data</category>
            <category>cloud</category>
            <category>certification</category>
            <category>Data Engineering</category>
            <category>fabric</category>
            <category>experience</category>
        </item>
        <item>
            <title><![CDATA[Lakehouse vs Data Warehouse: What's the Difference and When to Use Each]]></title>
            <link>https://www.recodehive.com/blog/lakehouse-vs-warehouse</link>
            <guid>https://www.recodehive.com/blog/lakehouse-vs-warehouse</guid>
            <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Lakehouse and Data Warehouse are two of the most debated architectures in modern data engineering. This article breaks down how they differ, where each fits in the data lifecycle, and how to choose between them, without the platform bias.]]></description>
            <content:encoded><![CDATA[<p>I made a mistake in my second month as a data engineer.</p>
<p>Our startup was growing fast, three data sources had become twelve almost overnight. Product events from Mixpanel, orders from Shopify, support tickets from Zendesk, raw logs from our backend. I needed everything in one place, queryable, fast.</p>
<p>So I did what made sense at the time: I dumped everything into our Snowflake warehouse. Raw JSON blobs, unnested arrays, half-cleaned API responses — all of it, straight in.</p>
<p>Three weeks later, our BI team couldn't trust a single number. Our schema was a mess. Re-ingesting data cost us real money. And every new data source I added made things worse, not better.</p>
<p>That mess is what taught me the real difference between a <strong>Lakehouse</strong> and a <strong>Data Warehouse</strong> and more importantly, why you almost always need both.</p>
<p><img decoding="async" loading="lazy" alt="Lakehouse Vs Warehouse" src="https://www.recodehive.com/assets/images/lake_vs_ware-dd4d2995303914c36b714f9340288089.png" width="1672" height="941" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-a-data-warehouse">What Is a Data Warehouse?<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#what-is-a-data-warehouse" class="hash-link" aria-label="Direct link to What Is a Data Warehouse?" title="Direct link to What Is a Data Warehouse?" translate="no">​</a></h2>
<p>After my Snowflake disaster, a senior engineer on the team pulled me aside and said something I didn't fully appreciate at the time:</p>
<blockquote>
<p><em>"A warehouse is not a dumping ground. It's a showroom."</em></p>
</blockquote>
<p>He was right. The Data Warehouse has been the backbone of business intelligence for decades precisely because it enforces discipline. Data must be cleaned and structured <strong>before</strong> it enters. No exceptions.</p>
<p>This is called <strong>schema-on-write</strong>, the shape of your data is defined upfront, and anything that doesn't fit gets rejected. That strictness feels like a constraint until you're the analyst trying to build a board-level revenue report and you actually need to trust the numbers.</p>
<p><strong>Key characteristics:</strong></p>
<ul>
<li class="">
<ol>
<li class="">Designed for structured, cleaned, analytics-ready data</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class="">Strict schema enforcement (schema-on-write)</li>
</ol>
</li>
<li class="">
<ol start="3">
<li class="">Highly optimized for SQL-based analytical queries</li>
</ol>
</li>
<li class="">
<ol start="4">
<li class="">Strong governance, security, and access controls</li>
</ol>
</li>
<li class="">
<ol start="5">
<li class="">Primary consumers are SQL analysts, BI teams, and business stakeholders</li>
</ol>
</li>
</ul>
<p>Platforms like <strong>Snowflake</strong>, <strong>Google BigQuery</strong>, <strong>Amazon Redshift</strong>, and <strong>Azure Synapse</strong> are well-known implementations. They excel when your data is already clean and your consumers need fast, reliable SQL access.</p>
<p>My mistake wasn't using Snowflake. It was using it for the wrong stage of the pipeline.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-a-lakehouse">What Is a Lakehouse?<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#what-is-a-lakehouse" class="hash-link" aria-label="Direct link to What Is a Lakehouse?" title="Direct link to What Is a Lakehouse?" translate="no">​</a></h2>
<p>After the Snowflake incident, I started reading about data lakes. The pitch was appealing: store everything cheaply in raw form, figure out structure later.</p>
<p>So I tried that next. We set up an Azure Data Lake, dumped our raw files in -  CSVs, JSONs, Parquet, logs and called it a win.</p>
<p>Except six months later, nobody could find anything. Data existed, but nobody trusted it. There was no validation, no versioning, no way to know if what you were querying was the right version of a file. We had built what the industry lovingly calls a <strong>data swamp</strong>.</p>
<p>The Lakehouse pattern emerged to solve exactly this problem. It takes the cost efficiency and flexibility of object storage, and adds a proper table layer on top using open formats like <strong>Delta Lake</strong>, <strong>Apache Iceberg</strong>, or <strong>Apache Hudi</strong>. You get ACID transactions, schema enforcement, time travel, and SQL access without abandoning the flexibility of raw storage.</p>
<p><strong>Key characteristics:</strong></p>
<ul>
<li class="">
<ol>
<li class="">Stores raw, semi-structured, and structured data in a single system</li>
</ol>
</li>
<li class="">
<ol start="2">
<li class="">Uses open table formats (Delta Lake, Iceberg, Hudi)</li>
</ol>
</li>
<li class="">
<ol start="3">
<li class="">Supports multiple processing engines like Spark, Python, and SQL</li>
</ol>
</li>
<li class="">
<ol start="4">
<li class="">Schema can evolve over time as data needs change</li>
</ol>
</li>
<li class="">
<ol start="5">
<li class="">Supports both engineering pipelines and ML workflows from the same storage layer</li>
</ol>
</li>
</ul>
<p>Platforms like <strong>Databricks</strong> and modern cloud-native setups implement this pattern well. It's particularly powerful when your team spans both data engineering and data science — both can work from the same storage layer without stepping on each other.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="key-differences-at-a-glance">Key Differences at a Glance<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#key-differences-at-a-glance" class="hash-link" aria-label="Direct link to Key Differences at a Glance" title="Direct link to Key Differences at a Glance" translate="no">​</a></h2>
<table><thead><tr><th>Aspect</th><th>Lakehouse</th><th>Data Warehouse</th></tr></thead><tbody><tr><td><strong>Data Type</strong></td><td>Raw, semi-structured, and structured</td><td>Structured only</td></tr><tr><td><strong>Schema Approach</strong></td><td>Schema-on-read or evolving</td><td>Schema-on-write, strict</td></tr><tr><td><strong>Flexibility</strong></td><td>High</td><td>Moderate</td></tr><tr><td><strong>Processing Engines</strong></td><td>Spark, Python, SQL</td><td>Primarily SQL</td></tr><tr><td><strong>Primary Users</strong></td><td>Data Engineers, Data Scientists</td><td>Analysts, BI teams</td></tr><tr><td><strong>Primary Use Cases</strong></td><td>Ingestion, transformation, ML</td><td>Reporting, dashboards, ad-hoc analytics</td></tr><tr><td><strong>Governance Maturity</strong></td><td>Developing</td><td>Mature, well-established</td></tr><tr><td><strong>Storage Cost</strong></td><td>Lower (object storage)</td><td>Higher (optimized proprietary storage)</td></tr></tbody></table>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-use-a-lakehouse">When to Use a Lakehouse<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#when-to-use-a-lakehouse" class="hash-link" aria-label="Direct link to When to Use a Lakehouse" title="Direct link to When to Use a Lakehouse" translate="no">​</a></h2>
<p>Think of the Lakehouse as the <strong>engineering zone</strong>.</p>
<p>In our case, this is where raw Shopify orders land at 2am, where Mixpanel event logs pile up, where our ML team runs experiments on customer behavior data. It's messy in the best possible way flexible, cheap, and tolerant of the chaos that comes with early-stage data.</p>
<p>Use a Lakehouse when:</p>
<ul>
<li class="">You are ingesting raw or semi-structured data from APIs, event streams, IoT devices, or application logs</li>
<li class="">You need to run transformation and cleaning pipelines before data is analytics-ready</li>
<li class="">Your team works primarily in Spark or Python</li>
<li class="">Your schema changes frequently as business or source systems evolve</li>
<li class="">You are building ML features, training datasets, or experimental models</li>
<li class="">You need cost-efficient storage for large volumes of data at various stages of processing</li>
</ul>
<p>If I had started here instead of going straight to Snowflake, I would have saved myself three weeks of firefighting.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="when-to-use-a-data-warehouse">When to Use a Data Warehouse<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#when-to-use-a-data-warehouse" class="hash-link" aria-label="Direct link to When to Use a Data Warehouse" title="Direct link to When to Use a Data Warehouse" translate="no">​</a></h2>
<p>Think of the Data Warehouse as the <strong>consumption zone</strong>.</p>
<p>Once our data was cleaned and validated in the Lakehouse, we loaded curated datasets into Snowflake and <em>that</em> is when it finally worked the way it was supposed to. Our BI team connected Power BI to it, the finance team ran their monthly reports, and the numbers matched.</p>
<p>Use a Data Warehouse when:</p>
<ul>
<li class="">Data has already been transformed and is ready for consumption</li>
<li class="">Your consumers are SQL analysts or BI teams using tools like Tableau, Looker, or Power BI</li>
<li class="">You need fast, predictable query performance on large structured datasets</li>
<li class="">Governance, row-level security, and access controls are critical requirements</li>
<li class="">You are supporting stable, recurring reports that business decisions depend on</li>
</ul>
<p>The warehouse isn't where data is processed. It's where processed data is <em>served</em>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-they-work-together">How They Work Together<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#how-they-work-together" class="hash-link" aria-label="Direct link to How They Work Together" title="Direct link to How They Work Together" translate="no">​</a></h2>
<p>Here's what nobody tells you early enough: <strong>you almost always need both</strong>.</p>
<p>Lakehouse and Data Warehouse are not competing choices. They serve different stages of the same data lifecycle. Once we restructured our setup, the flow looked like this:</p>
<ol>
<li class="">Raw data lands in the Lakehouse : Shopify orders, Mixpanel events, Zendesk tickets, all of it</li>
<li class="">Our data engineers transform and clean it using Spark and dbt</li>
<li class="">Curated, structured datasets are loaded into Snowflake</li>
<li class="">Power BI and Tableau connect to Snowflake for dashboards and business reporting</li>
</ol>
<p>The Lakehouse handled the complexity of early-stage data. The Warehouse handled the reliability of what our stakeholders actually saw. Each did what it was best at.</p>
<p>The moment we stopped treating them as alternatives and started treating them as sequential layers, everything clicked.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="choosing-between-them">Choosing Between Them<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#choosing-between-them" class="hash-link" aria-label="Direct link to Choosing Between Them" title="Direct link to Choosing Between Them" translate="no">​</a></h2>
<p>If you're still unsure, here's the simplest filter I've found: <strong>ask who is consuming this data, and in what state.</strong></p>
<ul>
<li class="">If the consumer is a data engineer or data scientist working with raw or intermediate data → <strong>Lakehouse</strong></li>
<li class="">If the consumer is an analyst or business user needing clean, structured data for reporting → <strong>Data Warehouse</strong></li>
<li class="">If you have both types of consumers (and most teams do after a few months of growth) → <strong>use both, in sequence</strong></li>
</ul>
<p>The workload determines the architecture. Not preference, not trend, not what a vendor happens to be marketing this quarter.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion">Conclusion<a href="https://www.recodehive.com/blog/lakehouse-vs-warehouse#conclusion" class="hash-link" aria-label="Direct link to Conclusion" title="Direct link to Conclusion" translate="no">​</a></h2>
<p>I wasted a month learning this the hard way. You don't have to.</p>
<p>The Lakehouse gives you flexibility, scale, and support for diverse workloads across engineering and data science. The Data Warehouse gives you structure, query performance, and the governance that business reporting demands.</p>
<p>They're not rivals. They're teammates. And the best data platforms I've seen since don't choose between them — they use each exactly where it belongs, and build the pipeline that connects them.</p>
<p>If you're in the early stages of designing your data platform and figuring out where each piece fits, I'd love to compare notes.</p>
<p>🔗 <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>lakehouse</category>
            <category>data-warehouse</category>
            <category>data-engineering</category>
            <category>big-data</category>
            <category>delta-lake</category>
            <category>spark</category>
            <category>analytics</category>
            <category>snowflake</category>
            <category>databricks</category>
        </item>
        <item>
            <title><![CDATA[Microsoft Fabric: One Platform, One Lake, Every Data Workload]]></title>
            <link>https://www.recodehive.com/blog/microsoft-fabric-explained</link>
            <guid>https://www.recodehive.com/blog/microsoft-fabric-explained</guid>
            <pubDate>Fri, 01 May 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[Microsoft Fabric is a unified analytics platform that brings together data engineering, data science, real-time analytics, and business intelligence under a single roof — all built on OneLake. Learn how Fabric is architected, how data flows through it, and why it matters for modern data teams.]]></description>
            <content:encoded><![CDATA[<p>Modern data teams don't struggle because of a lack of tools - they struggle because of too many.</p>
<p>A typical data stack today might include a cloud data warehouse, an object store, a managed Spark environment, a pipeline orchestration tool, and a BI layer on top. Each powerful on its own. But getting them to work together, moving data across systems, keeping governance consistent, debugging failures across layers often becomes a bigger challenge than the actual data work itself.</p>
<p>I ran into this exact problem while building pipelines across Azure Data Factory, ADLS Gen2, and Synapse. Every hand-off between tools meant another connection to configure, another permission to grant, another place for something to silently break.</p>
<p>Microsoft Fabric takes a different approach, instead of adding another tool to the stack, it brings everything together into a single unified platform. Here's how it actually works.</p>
<p><img decoding="async" loading="lazy" alt="Fabric platform" src="https://www.recodehive.com/assets/images/fabric-unified-0e47ea8a86ce8b7176855a3efa7a91c3.png" width="1536" height="864" class="img_ev3q"></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-foundation-onelake">The Foundation: OneLake<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#the-foundation-onelake" class="hash-link" aria-label="Direct link to The Foundation: OneLake" title="Direct link to The Foundation: OneLake" translate="no">​</a></h2>
<p>Every component in Fabric is built on top of <strong>OneLake</strong>, the platform's unified, logical data lake and the single source of truth for your entire Fabric workspace.</p>
<p>Every workload, whether it's a Spark notebook, a SQL warehouse query, a Power BI report, or an ML experiment, reads from and writes to the same underlying storage. No data movement between services. No export-and-reload step when a data scientist needs access to a table a data engineer just built.</p>
<p>OneLake stores everything in <strong>Delta Parquet format</strong>, an open-source table format that supports ACID transactions, schema enforcement, time travel, and versioning. This matters: your data is not locked into a proprietary format. It's readable by Spark, DuckDB, Pandas, Polars, and most modern query engines outside of Fabric too.</p>
<blockquote>
<p>📖 Read more: <a href="https://learn.microsoft.com/en-us/fabric/onelake/onelake-overview" target="_blank" rel="noopener noreferrer" class="">What is OneLake?</a></p>
</blockquote>
<p>The first time I opened OneLake in my Fabric workspace, what struck me was how everything just <em>appeared</em>, my Lakehouse tables, my warehouse tables, all visible in one file explorer without any registration or sync step. That's when the "one lake" concept clicked for me practically, not just conceptually.</p>
<p><img decoding="async" loading="lazy" alt="OneLake file explorer showing Lakehouse and Warehouse tables in one view" src="https://www.recodehive.com/assets/images/onelake-explorer-5dfc0845fd3bf18548309abc13be0a20.png" width="1817" height="812" class="img_ev3q">
<em>📸 Screenshot: OneLake file explorer from my Fabric workspace — Lakehouse and Warehouse tables visible side by side</em></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="data-engineering-lakehouses-spark-and-notebooks">Data Engineering: Lakehouses, Spark, and Notebooks<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#data-engineering-lakehouses-spark-and-notebooks" class="hash-link" aria-label="Direct link to Data Engineering: Lakehouses, Spark, and Notebooks" title="Direct link to Data Engineering: Lakehouses, Spark, and Notebooks" translate="no">​</a></h2>
<p>Fabric's data engineering experience is organized around the <strong>Lakehouse</strong> — a storage construct that combines the flexibility of a data lake with the query capabilities of a data warehouse.</p>
<p>When you create a Lakehouse, you get a two-zone structure:</p>
<ul>
<li class="">A <strong>Files area</strong> for raw, unstructured, or semi-structured data (CSV, JSON, images, logs)</li>
<li class="">A <strong>Tables area</strong> where data is stored as managed Delta tables, immediately queryable by SQL, Spark, and Power BI</li>
</ul>
<p>For transformation workloads, Fabric provides a fully managed <strong>Apache Spark</strong> environment. You write notebooks in Python, Scala, SQL, or R. Clusters are serverless by default — they start on demand, require no configuration, and shut down automatically when idle.</p>
<blockquote>
<p>📖 Read more: <a href="https://learn.microsoft.com/en-us/fabric/data-engineering/spark-overview" target="_blank" rel="noopener noreferrer" class="">Apache Spark in Microsoft Fabric</a></p>
</blockquote>
<p><img decoding="async" loading="lazy" alt="Spark notebook running in Fabric with Python code and Delta table output" src="https://www.recodehive.com/assets/images/fabric-spark-notebook-cf5abf219a65d19a48cf171ace72864d.png" width="1852" height="898" class="img_ev3q">
<em>📸 Screenshot: A Spark notebook from my Fabric workspace — reading raw CSV from the Files zone, writing a clean Delta table to Tables</em></p>
<p>Coming from standalone Databricks, the Spark notebook experience in Fabric felt noticeably lighter to set up. No cluster configuration, no runtime version juggling, you open a notebook and it just works.</p>
<p>For production workloads, you can promote notebooks to <strong>Spark Job Definitions</strong> for scheduled execution, and manage library dependencies using <strong>Environments</strong>, versioned, shareable Spark configurations that eliminate the classic "works on my cluster" problem.</p>
<blockquote>
<p>📖 Read more: <a href="https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview" target="_blank" rel="noopener noreferrer" class="">Fabric Lakehouse overview</a></p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="data-ingestion-and-orchestration-data-factory">Data Ingestion and Orchestration: Data Factory<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#data-ingestion-and-orchestration-data-factory" class="hash-link" aria-label="Direct link to Data Ingestion and Orchestration: Data Factory" title="Direct link to Data Ingestion and Orchestration: Data Factory" translate="no">​</a></h2>
<p>Getting data from external systems into the Lakehouse is the job of <strong>Data Factory</strong>, Fabric's data integration and orchestration layer.</p>
<p>Data Factory offers two primary patterns:</p>
<p><strong>Pipelines</strong> - The activity-based orchestration tool, familiar to anyone who has used Azure Data Factory or Apache Airflow. You build directed acyclic graphs of copy activities, transformation steps, conditional logic, and triggers. Fabric pipelines support hundreds of connectors to external databases, REST APIs, cloud storage, and SaaS applications.</p>
<p><strong>Dataflows Gen2</strong> - A code-free alternative using a visual, Power Query-based interface. Transformations compile to Spark or SQL execution under the hood, a practical option for analysts who need to express transformation logic without writing code.</p>
<p><img decoding="async" loading="lazy" alt="Data Factory pipeline canvas in Fabric showing a multi-step ingestion pipeline" src="https://www.recodehive.com/assets/images/fabric-pipeline-c5ecc4ab28753417b0bc9d9922cbafa5.png" width="1533" height="502" class="img_ev3q">
<em>📸 Screenshot: A pipeline from my Fabric workspace ingesting from a REST API into the Lakehouse — configured entirely within Fabric, no external ADF instance needed</em></p>
<p>One thing I genuinely appreciated: neither pipelines nor dataflows require a separate connection configuration to reach your Lakehouse because it's already in the same workspace. You select it from a dropdown. Small thing, big time saver when you're building pipelines daily.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="sql-analytics-the-data-warehouse">SQL Analytics: The Data Warehouse<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#sql-analytics-the-data-warehouse" class="hash-link" aria-label="Direct link to SQL Analytics: The Data Warehouse" title="Direct link to SQL Analytics: The Data Warehouse" translate="no">​</a></h2>
<p>Fabric's <strong>Data Warehouse</strong> is a fully managed T-SQL analytics engine, but with an important architectural distinction. It stores its data in Delta Parquet on OneLake, not in a proprietary internal format.</p>
<p>This means tables written by your Spark notebooks in the Lakehouse are directly readable by warehouse SQL queries and warehouse tables are readable by Spark without any copy or ETL step in between.</p>
<p><strong>A practical decision guide:</strong></p>
<table><thead><tr><th>Use the Lakehouse when...</th><th>Use the Warehouse when...</th></tr></thead><tbody><tr><td>Workloads are Spark-heavy</td><td>Consumers are SQL analysts</td></tr><tr><td>Data is schema-flexible</td><td>Structured, governed tables are needed</td></tr><tr><td>Programmatic transformation logic is required</td><td>Strong query performance with SQL semantics is the priority</td></tr></tbody></table>
<p><img decoding="async" loading="lazy" alt="Fabric SQL Warehouse query editor" src="https://www.recodehive.com/assets/images/fabric-warehouse-sql-31b56186a2f24e4cb8066f3843804765.png" width="1767" height="735" class="img_ev3q">
<em>📸 Screenshot: Querying a Lakehouse Delta table directly from the Fabric Warehouse SQL editor — no data copy needed</em></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="real-time-intelligence-streaming-and-event-data">Real-Time Intelligence: Streaming and Event Data<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#real-time-intelligence-streaming-and-event-data" class="hash-link" aria-label="Direct link to Real-Time Intelligence: Streaming and Event Data" title="Direct link to Real-Time Intelligence: Streaming and Event Data" translate="no">​</a></h2>
<p><strong>Real-Time Intelligence</strong> is Fabric's answer to streaming workloads and one of the more complete streaming experiences available within a unified platform.</p>
<p><strong>Eventstreams</strong> act as a managed event streaming layer. You connect to sources like Azure Event Hubs, Kafka, or IoT Hub, apply in-flight transformations using a visual stream-processing editor, and route output to multiple destinations simultaneously.</p>
<p>The destination for high-frequency event data is typically an <strong>Eventhouse</strong>, which contains one or more <strong>KQL databases</strong>. KQL (Kusto Query Language) is optimized for time-series and log data significantly faster than SQL for streaming analytics queries like "show me anomalies in sensor readings in the last 15 minutes, grouped by device."</p>
<p>Crucially, Eventhouse data also lives in OneLake meaning historical event data can be joined with batch data from the Lakehouse or Warehouse without a separate data movement step.</p>
<blockquote>
<p>📖 Read more: <a href="https://learn.microsoft.com/en-us/fabric/real-time-intelligence/overview" target="_blank" rel="noopener noreferrer" class="">Real-Time Intelligence in Microsoft Fabric</a></p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="data-science-and-machine-learning">Data Science and Machine Learning<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#data-science-and-machine-learning" class="hash-link" aria-label="Direct link to Data Science and Machine Learning" title="Direct link to Data Science and Machine Learning" translate="no">​</a></h2>
<p>Fabric's <strong>Data Science</strong> experience covers the full ML lifecycle — from exploratory analysis through model training, evaluation, and deployment.</p>
<p>The primary workspace is Jupyter-style notebooks backed by managed Spark, with access to the full Python ML ecosystem (scikit-learn, XGBoost, PyTorch, TensorFlow) and <strong>SynapseML</strong> for distributed ML on Spark.</p>
<p>Fabric integrates <strong>MLflow</strong> natively for experiment tracking and model registration. Models can be used for batch scoring directly against Lakehouse tables using the <code>PREDICT</code> function in Spark SQL — no separate serving infrastructure required for batch inference.</p>
<p>The deeper value: feature tables built by data engineers in the Lakehouse are immediately accessible in ML notebooks without copying or re-ingesting data. The gap between data engineering and data science shrinks considerably when both are working against the same underlying tables.</p>
<blockquote>
<p>📖 Read more: <a href="https://learn.microsoft.com/en-us/fabric/data-science/data-science-overview" target="_blank" rel="noopener noreferrer" class="">Data Science in Microsoft Fabric</a></p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="security-and-governance-built-in">Security and Governance: Built In<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#security-and-governance-built-in" class="hash-link" aria-label="Direct link to Security and Governance: Built In" title="Direct link to Security and Governance: Built In" translate="no">​</a></h2>
<p>One of the more understated strengths of Fabric's unified architecture is what it enables for governance. When all your data lives in one place, you define access policies once — not once per service.</p>
<p>Fabric integrates with <strong>Microsoft Entra ID</strong> for identity and access management, and with <strong>Microsoft Purview</strong> for data cataloging, lineage tracking, and sensitivity labeling. Row-level security, column-level security, and workspace-level access controls are applied uniformly across all Fabric experiences.</p>
<p>A sensitivity label applied to a table in the Lakehouse is respected when that same table is queried from the Warehouse or visualized in Power BI, a significant operational advantage over managing access policies across a fragmented stack.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="power-bi-reporting-without-data-duplication">Power BI: Reporting Without Data Duplication<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#power-bi-reporting-without-data-duplication" class="hash-link" aria-label="Direct link to Power BI: Reporting Without Data Duplication" title="Direct link to Power BI: Reporting Without Data Duplication" translate="no">​</a></h2>
<p>Power BI is the reporting layer and in Fabric, it gains <strong>DirectLake mode</strong>, which addresses one of its longest-standing pain points.</p>
<p>Traditionally, Power BI reports could either:</p>
<ul>
<li class="">Query live data (slow, puts load on source systems), or</li>
<li class="">Import data into an in-memory model (fast, but creates a stale copy requiring scheduled refreshes)</li>
</ul>
<p>DirectLake is a third mode - it reads directly from Delta Parquet files in OneLake at query time, delivering import-speed performance without maintaining a separate copy of the data.</p>
<p>For data engineers, this changes everything. Once your pipeline writes a clean Delta table to the Lakehouse, a Power BI report can query it in DirectLake mode immediately, no refresh schedule, no import process, no synchronization lag.</p>
<p><img decoding="async" loading="lazy" alt="Power BI report connected to Fabric Lakehouse in DirectLake mode" src="https://www.recodehive.com/assets/images/fabric-directlake-powerbi-d9d3be0234979fbf132feae773f9ef36.png" width="3706" height="1840" class="img_ev3q">
<em>📸 Screenshot: A Power BI report in DirectLake mode querying my Fabric Lakehouse — always current as of the last pipeline run</em></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="bringing-it-all-together">Bringing It All Together<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#bringing-it-all-together" class="hash-link" aria-label="Direct link to Bringing It All Together" title="Direct link to Bringing It All Together" translate="no">​</a></h2>
<p>The reason Fabric is worth serious evaluation is not any individual component — it's what the unified architecture enables across all of them.</p>
<p>A pipeline in Data Factory writes to a Lakehouse → A Spark notebook transforms it into a clean Delta table → A data scientist trains a model against that table → A warehouse analyst queries it in SQL → A Power BI report visualizes it in DirectLake mode → An Eventstream feeds real-time data into the same Lakehouse alongside batch data. Throughout all of this, Purview tracks lineage and Entra enforces access policies.</p>
<p>None of these steps require a separate connector, a data copy, or a cross-service authentication configuration. They are all reading from OneLake.</p>
<p>For teams that have spent years managing the operational overhead of a fragmented data stack, that's a genuinely meaningful shift, one where the platform handles the integration, and engineers can focus on the work that actually matters.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="try-it-yourself">Try It Yourself<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#try-it-yourself" class="hash-link" aria-label="Direct link to Try It Yourself" title="Direct link to Try It Yourself" translate="no">​</a></h2>
<ul>
<li class=""><strong>Microsoft Fabric Free Trial</strong> → <a href="https://app.fabric.microsoft.com/" target="_blank" rel="noopener noreferrer" class="">app.fabric.microsoft.com</a></li>
<li class=""><strong>Full Documentation</strong> → <a href="https://learn.microsoft.com/fabric" target="_blank" rel="noopener noreferrer" class="">learn.microsoft.com/fabric</a></li>
<li class=""><strong>OneLake Documentation</strong> → <a href="https://learn.microsoft.com/en-us/fabric/onelake/onelake-overview" target="_blank" rel="noopener noreferrer" class="">What is OneLake?</a></li>
<li class=""><strong>Apache Spark in Fabric</strong> → <a href="https://learn.microsoft.com/en-us/fabric/data-engineering/spark-overview" target="_blank" rel="noopener noreferrer" class="">Spark overview</a></li>
<li class=""><strong>Real-Time Intelligence</strong> → <a href="https://learn.microsoft.com/en-us/fabric/real-time-intelligence/overview" target="_blank" rel="noopener noreferrer" class="">RTI overview</a></li>
<li class=""><strong>Data Science in Fabric</strong> → <a href="https://learn.microsoft.com/en-us/fabric/data-science/data-science-overview" target="_blank" rel="noopener noreferrer" class="">Data Science overview</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="about-the-author">About the Author<a href="https://www.recodehive.com/blog/microsoft-fabric-explained#about-the-author" class="hash-link" aria-label="Direct link to About the Author" title="Direct link to About the Author" translate="no">​</a></h2>
<p>I'm <strong>Aditya Singh Rathore</strong>, a Data Engineer passionate about building modern, scalable data platforms. I write about Microsoft Fabric, Azure data tools, and real-world data engineering on <a href="https://www.recodehive.com/" target="_blank" rel="noopener noreferrer" class="">RecodeHive</a>,breaking down complex concepts into practical, actionable content.</p>
<p>If this article helped you understand Microsoft Fabric better, consider sharing it with your network. And if you're building something with Fabric or just getting started, I'd love to hear about it.</p>
<p>🔗 Connect with me on <a href="https://www.linkedin.com/in/aditya-singh-rathore0017/" target="_blank" rel="noopener noreferrer" class="">LinkedIn</a> | <a href="https://github.com/Adez017" target="_blank" rel="noopener noreferrer" class="">GitHub</a></p>
<p>📩 Have a topic you'd like me to cover? Drop it in the comments below.</p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>microsoft-fabric</category>
            <category>onelake</category>
            <category>data-engineering</category>
            <category>lakehouse</category>
            <category>delta-lake</category>
            <category>big-data</category>
            <category>cloud</category>
            <category>power-bi</category>
        </item>
        <item>
            <title><![CDATA[OpenAI AgentKit: Building AI Agents Without the Complexity]]></title>
            <link>https://www.recodehive.com/blog/open-ai-agent-builder</link>
            <guid>https://www.recodehive.com/blog/open-ai-agent-builder</guid>
            <pubDate>Wed, 15 Oct 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[OpenAI's AgentKit revolutionizes how developers build AI agents with its visual Agent Builder, integrated ChatKit, comprehensive evaluation tools, and seamless third-party integrations. Learn how this complete toolkit takes agents from prototype to production with minimal friction.]]></description>
            <content:encoded><![CDATA[<p>Hey there, AI builders! 👋</p>
<p>I still remember the days when building an AI agent meant wrestling with fragmented tools, managing complex API calls, debugging mysterious failures, and spending more time on infrastructure than actual innovation. It felt like trying to build a house while simultaneously manufacturing your own bricks.</p>
<p>That changed on October 6, 2025, when Sam Altman took the stage at OpenAI's Dev Day and unveiled AgentKit - a complete toolkit that promises to transform how we build, deploy, and optimize AI agents. Today, I want to walk you through what makes AgentKit special and why it might be the most significant developer tool launch from OpenAI yet.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-agentkit">What is AgentKit?<a href="https://www.recodehive.com/blog/open-ai-agent-builder#what-is-agentkit" class="hash-link" aria-label="Direct link to What is AgentKit?" title="Direct link to What is AgentKit?" translate="no">​</a></h2>
<p><a href="https://openai.com/index/introducing-agentkit/" target="_blank" rel="noopener noreferrer" class=""><strong>AgentKit</strong></a> is described by OpenAI CEO Sam Altman as a comprehensive set of building blocks designed to help developers take agents from prototype to production. But that simple description doesn't do it justice.</p>
<p>Think of AgentKit as the unified development platform that the AI agent ecosystem has been desperately needing. Instead of piecing together multiple tools, APIs, and services from different providers, you get everything in one coherent package that actually works together.</p>
<p>The promise? Build, deploy, and optimize agent workflows with significantly less friction.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-agentkit-matters-now">Why AgentKit Matters Now<a href="https://www.recodehive.com/blog/open-ai-agent-builder#why-agentkit-matters-now" class="hash-link" aria-label="Direct link to Why AgentKit Matters Now" title="Direct link to Why AgentKit Matters Now" translate="no">​</a></h2>
<p>Before we dive into the components, let's talk about timing. OpenAI's ChatGPT has reached 800 million weekly active users, making it one of the most widely used AI platforms in history. This massive user base represents an equally massive opportunity for developers to build AI-powered solutions.</p>
<p>The launch signals OpenAI's competitive move against other AI platforms racing to offer integrated tools for building autonomous agents that can perform complex tasks, not just respond to prompts. We're witnessing the shift from conversational AI to truly agentic AI - systems that can take action, use tools, and accomplish multi-step goals autonomously.</p>
<p><img decoding="async" loading="lazy" alt="A demo image showing agentkit interface" src="https://www.recodehive.com/assets/images/Agent_interface-5922eb54b63782bed24cf7563a227f48.png" width="1920" height="1080" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-four-pillars-of-agentkit">The Four Pillars of AgentKit<a href="https://www.recodehive.com/blog/open-ai-agent-builder#the-four-pillars-of-agentkit" class="hash-link" aria-label="Direct link to The Four Pillars of AgentKit" title="Direct link to The Four Pillars of AgentKit" translate="no">​</a></h2>
<p>AgentKit isn't just one tool - it's a complete ecosystem built around four core capabilities. Let's explore each one and understand how they work together.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-agent-builder-the-visual-workflow-editor">1. Agent Builder: The Visual Workflow Editor<a href="https://www.recodehive.com/blog/open-ai-agent-builder#1-agent-builder-the-visual-workflow-editor" class="hash-link" aria-label="Direct link to 1. Agent Builder: The Visual Workflow Editor" title="Direct link to 1. Agent Builder: The Visual Workflow Editor" translate="no">​</a></h3>
<p>Altman described Agent Builder as "like Canva for building agents" - a fast, visual way to design the logic, steps, and ideas.</p>
<p>This is the headline feature that's getting everyone excited, and for good reason. Remember when website builders transformed from hand-coding HTML to drag-and-drop interfaces? Agent Builder does the same thing for AI agent development.</p>
<p><strong>What Agent Builder Does:</strong></p>
<ul>
<li class="">Provides a visual canvas for designing agent workflows</li>
<li class="">Uses drag-and-drop components to define agent logic</li>
<li class="">Built on top of the Responses API that hundreds of thousands of developers already use</li>
<li class="">Eliminates the need to write boilerplate code for common agent patterns</li>
</ul>
<p><strong>Why This Matters:</strong>
Here's the thing - even experienced developers spend a disproportionate amount of time on scaffolding and infrastructure when building agents. Agent Builder abstracts away the repetitive parts while still giving you control over the important decisions.</p>
<p><strong>The Power of Visual Design:</strong>
When you can see your agent's workflow as a visual graph, you can:</p>
<ul>
<li class="">Spot logical errors before they become runtime bugs</li>
<li class="">Understand complex conditional flows at a glance</li>
<li class="">Iterate faster by rearranging components visually</li>
<li class="">Collaborate with non-technical stakeholders who can understand the visual representation</li>
</ul>
<p>Think of it this way: If traditional agent development is like writing assembly code, Agent Builder is like using a modern IDE with IntelliSense, debugger, and visual tools all built in.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-chatkit-embeddable-chat-interfaces-made-simple">2. ChatKit: Embeddable Chat Interfaces Made Simple<a href="https://www.recodehive.com/blog/open-ai-agent-builder#2-chatkit-embeddable-chat-interfaces-made-simple" class="hash-link" aria-label="Direct link to 2. ChatKit: Embeddable Chat Interfaces Made Simple" title="Direct link to 2. ChatKit: Embeddable Chat Interfaces Made Simple" translate="no">​</a></h3>
<p>The second pillar of AgentKit is ChatKit - and this is where things get really practical for product builders.</p>
<p><strong>What ChatKit Provides:</strong>
A simple embeddable chat interface that developers can use to bring chat experiences into their own apps, with the ability to bring your own brand, workflows, and whatever makes your product unique.</p>
<p><strong>Why ChatKit Is Brilliant:</strong>
Building a good chat interface is harder than it looks. You need to handle:</p>
<ul>
<li class="">Message threading and history</li>
<li class="">Streaming responses for better UX</li>
<li class="">Error handling and retry logic</li>
<li class="">Mobile responsiveness</li>
<li class="">Accessibility features</li>
<li class="">Loading states and animations</li>
</ul>
<p>ChatKit handles all of this out of the box, but here's the clever part - it's not a black box. You can customize it to match your brand, inject your own business logic, and integrate it seamlessly into existing applications.</p>
<p>The beauty is that you're not starting from scratch. You're building on a foundation that's been battle-tested by millions of users in ChatGPT.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-evals-for-agents-measuring-what-matters">3. Evals for Agents: Measuring What Matters<a href="https://www.recodehive.com/blog/open-ai-agent-builder#3-evals-for-agents-measuring-what-matters" class="hash-link" aria-label="Direct link to 3. Evals for Agents: Measuring What Matters" title="Direct link to 3. Evals for Agents: Measuring What Matters" translate="no">​</a></h3>
<p>This is where AgentKit gets serious about production deployments. Anyone can build a demo that works once. Building something reliable enough to bet your business on requires rigorous evaluation.</p>
<p><strong>What Evals for Agents Includes:</strong>
Tools to measure AI agent performance, including step-by-step trace grading, datasets for assessing individual agent components, automated prompt optimization, and the ability to run evaluations on external models.</p>
<p><strong>The Evaluation Challenge:</strong>
Here's what makes evaluating AI agents tricky:</p>
<ul>
<li class="">Unlike traditional software, agents are probabilistic - they might behave differently each time</li>
<li class="">Success isn't binary - there are degrees of correctness</li>
<li class="">Complex workflows have multiple failure points</li>
<li class="">Optimization in one area might break something else</li>
</ul>
<p><strong>How Evals for Agents Solves This:</strong></p>
<p><strong>Step-by-Step Trace Grading:</strong>
Instead of just looking at final outputs, you can evaluate each step in your agent's reasoning process. This is game-changing for debugging. When something goes wrong, you can pinpoint exactly which step failed and why.</p>
<p><strong>Component-Level Datasets:</strong>
You can create evaluation datasets for individual components of your agent. This modular approach means you can improve specific parts without worrying about breaking the whole system.</p>
<p><strong>Automated Prompt Optimization:</strong>
Prompt engineering is more art than science, but it doesn't have to be. With automated optimization, you can test variations systematically and let data drive your decisions.</p>
<p><strong>Cross-Model Evaluation:</strong>
The ability to run evaluations on external models directly from the OpenAI platform is subtle but powerful. It means you can compare performance across different models, optimize for cost vs. quality, and make informed decisions about model selection.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-connector-registry-secure-integration-at-scale">4. Connector Registry: Secure Integration at Scale<a href="https://www.recodehive.com/blog/open-ai-agent-builder#4-connector-registry-secure-integration-at-scale" class="hash-link" aria-label="Direct link to 4. Connector Registry: Secure Integration at Scale" title="Direct link to 4. Connector Registry: Secure Integration at Scale" translate="no">​</a></h3>
<p>The fourth pillar ties everything together by solving one of the thorniest problems in enterprise AI: secure, controlled access to internal tools and external services.</p>
<p><strong>What the Connector Registry Provides:</strong>
Developers can securely connect agents to internal tools and third-party systems through an admin control panel while maintaining security and control.</p>
<p><strong>Why This Matters for Enterprises:</strong>
When I talk to enterprise developers, the same concerns come up repeatedly:</p>
<ul>
<li class="">How do we give AI agents access to our systems without compromising security?</li>
<li class="">How do we audit what agents are doing with sensitive data?</li>
<li class="">How do we revoke access quickly if needed?</li>
<li class="">How do we comply with regulatory requirements?</li>
</ul>
<p>The Connector Registry addresses all of these with a centralized, controlled approach to integrations.</p>
<p><strong>The Security Model:</strong></p>
<ul>
<li class="">Centralized admin control panel for managing all connections</li>
<li class="">Granular permissions at the agent and tool level</li>
<li class="">Audit logs for compliance and debugging</li>
<li class="">Easy revocation and rotation of credentials</li>
<li class="">Support for OAuth and other enterprise authentication methods</li>
</ul>
<p><strong>The Developer Experience:</strong>
For developers, it's beautifully simple. Instead of managing API keys in environment variables and writing custom integration code, you:</p>
<ol>
<li class="">Select the connector you need from the registry</li>
<li class="">Authenticate through the admin panel</li>
<li class="">Use it in your agent with a simple reference</li>
</ol>
<p>The platform handles the rest - credential management, retries, rate limiting, and error handling.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="seeing-is-believing-the-live-demo">Seeing Is Believing: The Live Demo<a href="https://www.recodehive.com/blog/open-ai-agent-builder#seeing-is-believing-the-live-demo" class="hash-link" aria-label="Direct link to Seeing Is Believing: The Live Demo" title="Direct link to Seeing Is Believing: The Live Demo" translate="no">​</a></h2>
<p>One of the most compelling moments from Dev Day was when OpenAI engineer Christina Huang built an entire AI workflow and two AI agents live onstage in under eight minutes.</p>
<p>Let me repeat that: <strong>under eight minutes</strong>. From zero to a working multi-agent system.</p>
<p>This wasn't a pre-recorded demo with everything perfectly set up. This was live, unscripted development that showed what's possible when you remove unnecessary friction from the development process.</p>
<p>What would that same task have taken before AgentKit? Probably hours of coding, debugging, and testing. And that's if you're an experienced AI developer who knows all the APIs and best practices.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-the-components-work-together">How the Components Work Together<a href="https://www.recodehive.com/blog/open-ai-agent-builder#how-the-components-work-together" class="hash-link" aria-label="Direct link to How the Components Work Together" title="Direct link to How the Components Work Together" translate="no">​</a></h2>
<p>Now that we've covered the four pillars individually, let's see how they create a unified development experience:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-development-flow">The Development Flow<a href="https://www.recodehive.com/blog/open-ai-agent-builder#the-development-flow" class="hash-link" aria-label="Direct link to The Development Flow" title="Direct link to The Development Flow" translate="no">​</a></h3>
<p><strong>Step 1: Design Your Agent</strong>
Start in Agent Builder, visually mapping out your agent's workflow. Define the steps, decision points, and tool usage without writing any code.</p>
<p><strong>Step 2: Connect Your Tools</strong>
Use the Connector Registry to securely link your agent to the services it needs - databases, APIs, internal tools, whatever your use case requires.</p>
<p><strong>Step 3: Add the Interface</strong>
Integrate ChatKit to give your users a polished way to interact with your agent. Customize it to match your brand and product experience.</p>
<p><strong>Step 4: Evaluate and Optimize</strong>
Use Evals for Agents to measure performance, identify weaknesses, and systematically improve your agent's reliability.</p>
<p><strong>Step 5: Deploy and Monitor</strong>
Push to production with confidence, knowing you have the evaluation framework to catch issues and the tools to iterate quickly.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-iteration-loop">The Iteration Loop<a href="https://www.recodehive.com/blog/open-ai-agent-builder#the-iteration-loop" class="hash-link" aria-label="Direct link to The Iteration Loop" title="Direct link to The Iteration Loop" translate="no">​</a></h3>
<p>Here's where the integrated approach really shines. Traditional development has a slow feedback loop:</p>
<ol>
<li class="">Write code</li>
<li class="">Deploy to test environment</li>
<li class="">Manually test</li>
<li class="">Find bugs</li>
<li class="">Fix bugs</li>
<li class="">Repeat</li>
</ol>
<p>With AgentKit, the loop is much tighter:</p>
<ol>
<li class="">Adjust agent visually in Agent Builder</li>
<li class="">Run automated evals</li>
<li class="">See results immediately</li>
<li class="">Iterate based on data</li>
</ol>
<p>This faster iteration cycle means you can explore more possibilities, validate assumptions quickly, and get to production-ready faster.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-philosophy-behind-agentkit">The Philosophy Behind AgentKit<a href="https://www.recodehive.com/blog/open-ai-agent-builder#the-philosophy-behind-agentkit" class="hash-link" aria-label="Direct link to The Philosophy Behind AgentKit" title="Direct link to The Philosophy Behind AgentKit" translate="no">​</a></h2>
<p>Altman noted that AgentKit is "all the stuff that we wished we had when we were trying to build our first agents". This statement reveals something important about OpenAI's approach.</p>
<p>AgentKit wasn't designed in a vacuum by people who don't build with AI. It was designed by the same team that's been building ChatGPT, GPT-4, and other cutting-edge AI systems. They've felt the pain points, hit the roadblocks, and now they're sharing the solutions they wish they'd had.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="opinionated-but-flexible">Opinionated But Flexible<a href="https://www.recodehive.com/blog/open-ai-agent-builder#opinionated-but-flexible" class="hash-link" aria-label="Direct link to Opinionated But Flexible" title="Direct link to Opinionated But Flexible" translate="no">​</a></h3>
<p>AgentKit makes strong opinions about the right way to build agents:</p>
<ul>
<li class="">Visual design over code-first approaches</li>
<li class="">Evaluation-driven development over manual testing</li>
<li class="">Secure, centralized integrations over scattered API keys</li>
<li class="">Component reusability over monolithic builds</li>
</ul>
<p>But these opinions don't lock you in. Agent Builder is built on top of the Responses API that hundreds of thousands of developers already use, which means you can drop down to code when you need more control.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="production-ready-from-day-one">Production-Ready from Day One<a href="https://www.recodehive.com/blog/open-ai-agent-builder#production-ready-from-day-one" class="hash-link" aria-label="Direct link to Production-Ready from Day One" title="Direct link to Production-Ready from Day One" translate="no">​</a></h3>
<p>Many developer tools focus on getting you to "hello world" quickly but leave you on your own for production concerns. AgentKit takes the opposite approach - it's designed for production from the start.</p>
<p>The inclusion of Evals, the Connector Registry with admin controls, and the focus on security and reliability all signal that this isn't a toy for prototypes. It's infrastructure for building real businesses on.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="who-benefits-most-from-agentkit">Who Benefits Most from AgentKit?<a href="https://www.recodehive.com/blog/open-ai-agent-builder#who-benefits-most-from-agentkit" class="hash-link" aria-label="Direct link to Who Benefits Most from AgentKit?" title="Direct link to Who Benefits Most from AgentKit?" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="individual-developers">Individual Developers<a href="https://www.recodehive.com/blog/open-ai-agent-builder#individual-developers" class="hash-link" aria-label="Direct link to Individual Developers" title="Direct link to Individual Developers" translate="no">​</a></h3>
<p>If you're a solo developer with an idea for an AI-powered product, AgentKit dramatically lowers the barrier to entry. You don't need a team of ML engineers and DevOps specialists. You can build, evaluate, and deploy agents yourself.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="startups">Startups<a href="https://www.recodehive.com/blog/open-ai-agent-builder#startups" class="hash-link" aria-label="Direct link to Startups" title="Direct link to Startups" translate="no">​</a></h3>
<p>For startups, AgentKit means faster time to market and lower development costs. Instead of spending months on infrastructure, you can focus on your unique value proposition and get to product-market fit faster.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="enterprise-teams">Enterprise Teams<a href="https://www.recodehive.com/blog/open-ai-agent-builder#enterprise-teams" class="hash-link" aria-label="Direct link to Enterprise Teams" title="Direct link to Enterprise Teams" translate="no">​</a></h3>
<p>OpenAI has already signed on several launch partners that have scaled agents using AgentKit. For enterprises, the value is in the security model, evaluation framework, and ability to standardize on a single platform across teams.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="non-technical-founders">Non-Technical Founders<a href="https://www.recodehive.com/blog/open-ai-agent-builder#non-technical-founders" class="hash-link" aria-label="Direct link to Non-Technical Founders" title="Direct link to Non-Technical Founders" translate="no">​</a></h3>
<p>Here's a bold prediction: AgentKit will enable non-technical founders to build AI products that would have previously required a technical co-founder. The visual nature of Agent Builder, combined with the pre-built components, puts agent development within reach of anyone willing to learn.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-competitive-landscape">The Competitive Landscape<a href="https://www.recodehive.com/blog/open-ai-agent-builder#the-competitive-landscape" class="hash-link" aria-label="Direct link to The Competitive Landscape" title="Direct link to The Competitive Landscape" translate="no">​</a></h2>
<p>The launch highlights OpenAI's push to increase developer adoption by making agent building faster and easier, and signals a competitive move against other AI platforms racing to offer integrated tools.</p>
<p>The AI infrastructure space is heating up, with players like:</p>
<ul>
<li class="">LangChain providing agent frameworks</li>
<li class="">AutoGen offering multi-agent systems</li>
<li class="">Anthropic's Claude with computer use</li>
<li class="">Numerous startups building agent platforms</li>
</ul>
<p>What makes AgentKit different is the integration. While other tools focus on one piece of the puzzle, AgentKit provides the whole solution - from design to deployment to evaluation.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="best-practices-for-building-with-agentkit">Best Practices for Building with AgentKit<a href="https://www.recodehive.com/blog/open-ai-agent-builder#best-practices-for-building-with-agentkit" class="hash-link" aria-label="Direct link to Best Practices for Building with AgentKit" title="Direct link to Best Practices for Building with AgentKit" translate="no">​</a></h2>
<p>Based on what we know about AgentKit and agent development in general, here are some principles to keep in mind:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="start-simple-then-expand">Start Simple, Then Expand<a href="https://www.recodehive.com/blog/open-ai-agent-builder#start-simple-then-expand" class="hash-link" aria-label="Direct link to Start Simple, Then Expand" title="Direct link to Start Simple, Then Expand" translate="no">​</a></h3>
<p>Don't try to build a complex multi-agent system on day one. Start with a single, focused agent that does one thing well. Use Evals to make sure it's reliable, then add complexity gradually.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="evaluation-driven-development">Evaluation-Driven Development<a href="https://www.recodehive.com/blog/open-ai-agent-builder#evaluation-driven-development" class="hash-link" aria-label="Direct link to Evaluation-Driven Development" title="Direct link to Evaluation-Driven Development" translate="no">​</a></h3>
<p>Make evaluation a first-class part of your development process. Create eval datasets before you build, not after. This forces you to think clearly about what success looks like.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="embrace-the-visual-paradigm">Embrace the Visual Paradigm<a href="https://www.recodehive.com/blog/open-ai-agent-builder#embrace-the-visual-paradigm" class="hash-link" aria-label="Direct link to Embrace the Visual Paradigm" title="Direct link to Embrace the Visual Paradigm" translate="no">​</a></h3>
<p>If you're a code-first developer, give the visual builder a real chance. It might feel awkward at first, but the benefits of being able to see your agent's logic at a glance are substantial.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="security-first">Security First<a href="https://www.recodehive.com/blog/open-ai-agent-builder#security-first" class="hash-link" aria-label="Direct link to Security First" title="Direct link to Security First" translate="no">​</a></h3>
<p>Use the Connector Registry's admin controls from the start. Don't cut corners on security even in development. It's much harder to add security later than to build it in from the beginning.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="iterate-based-on-real-usage">Iterate Based on Real Usage<a href="https://www.recodehive.com/blog/open-ai-agent-builder#iterate-based-on-real-usage" class="hash-link" aria-label="Direct link to Iterate Based on Real Usage" title="Direct link to Iterate Based on Real Usage" translate="no">​</a></h3>
<p>Deploy early (to a small audience) and let real usage guide your improvements. The evaluation tools will help you identify where your agent is struggling with actual user queries.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-future-of-agent-development">The Future of Agent Development<a href="https://www.recodehive.com/blog/open-ai-agent-builder#the-future-of-agent-development" class="hash-link" aria-label="Direct link to The Future of Agent Development" title="Direct link to The Future of Agent Development" translate="no">​</a></h2>
<p>AgentKit represents a bet on the future of software development. OpenAI is betting that:</p>
<ol>
<li class=""><strong>Agents will be everywhere</strong> - Not just chatbots, but agents handling complex workflows across industries</li>
<li class=""><strong>Visual tools will dominate</strong> - The future of development is more visual, more accessible, and less code-heavy</li>
<li class=""><strong>Evaluation matters</strong> - As agents become critical infrastructure, systematic evaluation becomes non-negotiable</li>
<li class=""><strong>Integration is key</strong> - The value is in connecting AI to your existing tools and data, not just in the AI itself</li>
</ol>
<p>I think they're right on all counts.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="challenges-and-considerations">Challenges and Considerations<a href="https://www.recodehive.com/blog/open-ai-agent-builder#challenges-and-considerations" class="hash-link" aria-label="Direct link to Challenges and Considerations" title="Direct link to Challenges and Considerations" translate="no">​</a></h2>
<p>Of course, no tool is perfect. Here are some things to keep in mind:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="vendor-lock-in">Vendor Lock-In<a href="https://www.recodehive.com/blog/open-ai-agent-builder#vendor-lock-in" class="hash-link" aria-label="Direct link to Vendor Lock-In" title="Direct link to Vendor Lock-In" translate="no">​</a></h3>
<p>Building on AgentKit means building on OpenAI's platform. While you can run evaluations on external models, you're still deeply integrated with OpenAI's ecosystem. Make sure you're comfortable with that dependency.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="learning-curve">Learning Curve<a href="https://www.recodehive.com/blog/open-ai-agent-builder#learning-curve" class="hash-link" aria-label="Direct link to Learning Curve" title="Direct link to Learning Curve" translate="no">​</a></h3>
<p>While AgentKit aims to make agent development easier, there's still a learning curve. Understanding how to design effective agent workflows, write good evaluation criteria, and optimize for production takes time and practice.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="cost-considerations">Cost Considerations<a href="https://www.recodehive.com/blog/open-ai-agent-builder#cost-considerations" class="hash-link" aria-label="Direct link to Cost Considerations" title="Direct link to Cost Considerations" translate="no">​</a></h3>
<p>Using AI at scale isn't free. Make sure you understand the pricing model and factor in API costs when planning your application.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="limits-of-automation">Limits of Automation<a href="https://www.recodehive.com/blog/open-ai-agent-builder#limits-of-automation" class="hash-link" aria-label="Direct link to Limits of Automation" title="Direct link to Limits of Automation" translate="no">​</a></h3>
<p>Agent Builder is powerful, but it can't replace deep thinking about your problem domain. You still need to understand your users, design good workflows, and make strategic decisions.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="getting-started">Getting Started<a href="https://www.recodehive.com/blog/open-ai-agent-builder#getting-started" class="hash-link" aria-label="Direct link to Getting Started" title="Direct link to Getting Started" translate="no">​</a></h2>
<p>Ready to dive in? Here's how to get started with AgentKit:</p>
<ol>
<li class="">
<p><strong>Explore the Documentation</strong> - <a href="https://openai.com/index/introducing-agentkit/" target="_blank" rel="noopener noreferrer" class="">OpenAI's documentation</a> is comprehensive and includes tutorials for common use cases</p>
</li>
<li class="">
<p><strong>Start with Templates</strong> - Don't build from scratch if you don't have to. Start with templates and modify them for your needs</p>
</li>
<li class="">
<p><strong>Join the Community</strong> - Connect with other developers building with AgentKit. Share patterns, ask questions, and learn from others here : <a href="https://community.openai.com/" target="_blank" rel="noopener noreferrer" class="">https://community.openai.com/</a></p>
</li>
<li class="">
<p><strong>Build in Public</strong> - Share your progress and learnings. The community grows stronger when we share knowledge</p>
</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion-the-agent-era-begins">Conclusion: The Agent Era Begins<a href="https://www.recodehive.com/blog/open-ai-agent-builder#conclusion-the-agent-era-begins" class="hash-link" aria-label="Direct link to Conclusion: The Agent Era Begins" title="Direct link to Conclusion: The Agent Era Begins" translate="no">​</a></h2>
<p>AgentKit isn't just another developer tool - it's OpenAI's vision for how AI agent development should work. By removing friction, providing integrated tools, and making evaluation a first-class concern, AgentKit makes it possible for far more people to build production-grade AI agents.</p>
<p>Altman's statement that this is "all the stuff we wished we had when we were trying to build our first agents" resonates because it comes from real experience. This isn't theoretical - it's battle-tested approaches packaged for everyone.</p>
<p>Whether you're a seasoned AI developer looking to build faster, a startup trying to find product-market fit, or an enterprise scaling AI across your organization, AgentKit provides the foundation you need.</p>
<p>The question isn't whether agents will transform how we build software - they already are. The question is whether you'll be part of that transformation. With AgentKit, the barrier to entry has never been lower.</p>
<hr>
<p><em>The future of software is agentic, and AgentKit is your toolkit for building it. The only question left is: what will you build? 🚀</em></p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>OpenAI</category>
            <category>AgentKit</category>
            <category>AI Agents</category>
            <category>Agent Builder</category>
            <category>Agentic AI</category>
            <category>Developer Tools</category>
        </item>
        <item>
            <title><![CDATA[GitHub Copilot CLI: Public Preview]]></title>
            <link>https://www.recodehive.com/blog/github-cli-agent</link>
            <guid>https://www.recodehive.com/blog/github-cli-agent</guid>
            <pubDate>Wed, 17 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[GitHub bought power of GitHub Copilot coding agent directly to your terminal, with GitHub Copilot CLI, you can work locally and synchronously with an AI agent.]]></description>
            <content:encoded><![CDATA[<p> </p>
<p>GitHub Copilot CLI is now in public preview
GitHub bought power of GitHub Copilot coding agent directly to your terminal, with <a href="https://github.com/features/copilot/cli?utm_source=changelog-amp-linkedin&amp;utm_campaign=agentic-copilot-cli-launch-2025" target="_blank" rel="noopener noreferrer" class="">GitHub Copilot CLI</a>, you can work locally and synchronously with an AI agent that understands your code and GitHub context in depth.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-overview">📖 Overview<a href="https://www.recodehive.com/blog/github-cli-agent#-overview" class="hash-link" aria-label="Direct link to 📖 Overview" title="Direct link to 📖 Overview" translate="no">​</a></h2>
<p>GitHub Copilot CLI is now in <code>public preview</code>, and it’s designed to bring AI-powered development right to your command line. You can work locally and synchronously with an AI agent that understands your code and GitHub context no IDE switching required.</p>
<p><img decoding="async" loading="lazy" alt="GitHub Copilot CLI banner and overview image" src="https://www.recodehive.com/assets/images/cover-page-2-28142b85f8fc6854e3c2feea653d841e.png" width="1438" height="738" class="img_ev3q"></p>
<p>✨<strong>Key features:</strong></p>
<ul>
<li class="">✅<strong>Terminal-native dev</strong> – Use the Copilot coding agent directly in your terminal.</li>
<li class="">✅<strong>GitHub integration</strong> – Work with repositories, issues, and pull requests using llm.</li>
<li class="">✅<strong>Agentic capabilities</strong> – Build, edit, debug, and refactor code with AI.</li>
<li class="">✅<strong>MCP-powered extensibility</strong> – Extend with <code>custom MCP servers</code>.</li>
<li class="">✅<strong>Full control</strong> – Every action requires your explicit approval.</li>
</ul>
<p>Plus, extend Copilot CLI's capabilities and context through <strong>custom MCP servers</strong>.
Agent-powered, GitHub-native
Execute coding tasks with an agent that knows your repositories, issues, and pull requests — all natively in your terminal.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-getting-started">📦 Getting Started<a href="https://www.recodehive.com/blog/github-cli-agent#-getting-started" class="hash-link" aria-label="Direct link to 📦 Getting Started" title="Direct link to 📦 Getting Started" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="supported-platforms">Supported Platforms<a href="https://www.recodehive.com/blog/github-cli-agent#supported-platforms" class="hash-link" aria-label="Direct link to Supported Platforms" title="Direct link to Supported Platforms" translate="no">​</a></h3>
<ul>
<li class="">✅Linux</li>
<li class="">✅macOS</li>
<li class="">✅Windows (experimental)</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="prerequisites">Prerequisites<a href="https://www.recodehive.com/blog/github-cli-agent#prerequisites" class="hash-link" aria-label="Direct link to Prerequisites" title="Direct link to Prerequisites" translate="no">​</a></h3>
<ul>
<li class="">⚙️Node.js <strong>v22+</strong></li>
<li class="">⚙️npm <strong>v10+</strong></li>
<li class="">⚙️PowerShell <strong>v6+</strong> (Windows only)</li>
<li class="">⚙️Active GitHub Copilot subscription (Pro, Pro+, Business, or Enterprise)</li>
</ul>
<p>You can install the latest version of the powershell using this command and check the version as mentioned above it should be more than V6.</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">winget install Microsoft.PowerShell</span><br></div></code></pre></div></div>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">pwsh --version</span><br></div></code></pre></div></div>
<p><em>If you have access to GitHub Copilot via your organization of enterprise, you cannot use GitHub Copilot CLI if your organization owner or enterprise administrator has disabled it in the organization or enterprise settings. See Managing policies and features for GitHub Copilot in your organization for more information.</em></p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-installation">💽 Installation<a href="https://www.recodehive.com/blog/github-cli-agent#-installation" class="hash-link" aria-label="Direct link to 💽 Installation" title="Direct link to 💽 Installation" translate="no">​</a></h2>
<p>Install globally with npm:
Powered by the same agentic harness as GitHub's Copilot coding agent, it provides intelligent assistance while staying deeply integrated with your GitHub workflow.
Enter the prompt in the command line.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">npm install -g @github/copilot</span><br></div></code></pre></div></div>
<p><img decoding="async" loading="lazy" alt="Screenshot of npm install command for GitHub Copilot CLI" src="https://www.recodehive.com/assets/images/01-GitHub-CLI-start-command-8365f778dc024fea93ce73a4b4d1acba.png" width="1518" height="798" class="img_ev3q"></p>
<p>Verify installation: the below command will run the banner start image of GitHub Copilot.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">copilot --banner</span><br></div></code></pre></div></div>
<p>Authenticate with your GitHub account:
If you're not currently logged in to GitHub, you'll be prompted to use the <code>/login</code> slash command. Enter this command and follow the on-screen instructions to authenticate.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">/login</span><br></div></code></pre></div></div>
<p>Or authenticate using a <strong>Personal Access Token (PAT):</strong></p>
<p>You can also authenticate using a fine-graned PAT with the "Copilot Rrequests" permission enabled.
Visit <code>https://github.com/settings/personal-access-tokens/new</code>
Under <code>Permissions</code>," click add <code>permissions</code> and select <code>Copilot Requests</code>
Generate your token
Add the token to your environment via the environment variable GH_TOKEN or GITHUB_TOKEN.👇🏻</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain"># Linux/macOS</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">export GH_TOKEN=your_token_here  </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># Windows</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">setx GH_TOKEN your_token_here</span><br></div></code></pre></div></div>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-usage">🖥️ Usage<a href="https://www.recodehive.com/blog/github-cli-agent#%EF%B8%8F-usage" class="hash-link" aria-label="Direct link to 🖥️ Usage" title="Direct link to 🖥️ Usage" translate="no">​</a></h2>
<p>Once installed, run copilot on your terminal, Image of the splash screen for the Copilot CLI. The usage is pretty straight forward you can use the arrow keys to navigate to proceed cancel instruction etc.</p>
<p>Each time you submit a prompt to GitHub Copilot CLI, your monthly quota of premium requests is reduced by one. For information about premium requests,
<code>https://docs.github.com/en/copilot/concepts/billing/copilot-requests</code></p>
<p><img decoding="async" loading="lazy" alt="Splash screen of GitHub Copilot CLI showing navigation options" src="https://www.recodehive.com/assets/images/02-starting-copilot-db9e94321313621d47f828ea81de2997.png" width="1417" height="831" class="img_ev3q"></p>
<p>Launch Copilot CLI in a project folder:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">copilot</span><br></div></code></pre></div></div>
<p>By default, it runs <strong>Claude Sonnet 4</strong>. To switch to <strong>GPT-5</strong>:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain"># Linux/macOS</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">COPILOT_MODEL=gpt-5 copilot</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"># Windows</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">set COPILOT_MODEL=gpt-5</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="version-checking-and-exit-cli">Version checking and Exit CLI<a href="https://www.recodehive.com/blog/github-cli-agent#version-checking-and-exit-cli" class="hash-link" aria-label="Direct link to Version checking and Exit CLI" title="Direct link to Version checking and Exit CLI" translate="no">​</a></h2>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">copilot --version</span><br></div></code></pre></div></div>
<p>Exit anytime with:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Ctrl + C (twice)</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="get-suggestions-for-common-dev-tasks">Get Suggestions for Common Dev Tasks<a href="https://www.recodehive.com/blog/github-cli-agent#get-suggestions-for-common-dev-tasks" class="hash-link" aria-label="Direct link to Get Suggestions for Common Dev Tasks" title="Direct link to Get Suggestions for Common Dev Tasks" translate="no">​</a></h2>
<p>Now let's get started with development, here fork this repo and activate GitHub CLI and enter the below bash commands. <a href="https://github.com/recodehive/recode-website" target="_blank" rel="noopener noreferrer" class="">Website</a></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="list-of-all-commands-in-cli">List of all commands in CLI<a href="https://www.recodehive.com/blog/github-cli-agent#list-of-all-commands-in-cli" class="hash-link" aria-label="Direct link to List of all commands in CLI" title="Direct link to List of all commands in CLI" translate="no">​</a></h3>
<p>I have linked the offical website repo to log any bugs or do direct PR. <a href="https://github.com/github/copilot-cli?utm_source=changelog-amp-linkedin&amp;utm_campaign=agentic-copilot-cli-launch-2025" target="_blank" rel="noopener noreferrer" class="">GitHub CLI repo</a> and <a href="https://docs.github.com/en/copilot/how-tos/use-copilot-agents/use-copilot-cli?utm_campaign=agentic-copilot-cli-launch-2025&amp;utm_source=changelog-amp-linkedin" target="_blank" rel="noopener noreferrer" class="">Official Documentation</a></p>
<p><code>alias</code>
<code>api</code>
<code>attestation</code>
<code>auth</code>
<code>browse</code>
<code>cache</code>
<code>co</code>
<code>codespace</code>
<code>completion</code>
<code>config</code>
<code>extension</code>
<code>gist</code>
<code>gpg-key</code>
<code>issue</code>
<code>label</code>
<code>org</code>
<code>pr</code>
<code>preview</code>
<code>project</code>
<code>release</code>
<code>repo</code>
<code> ruleset</code>
<code>run</code>
<code>search</code>
<code>secret</code>
<code>ssh-key</code>
<code>status</code>
<code>variable</code>
<code>workflow</code></p>
<p>For preview to run enter the following command. 👇🏻</p>
<p><img decoding="async" loading="lazy" alt="Example output of running GitHub Copilot CLI suggest command" src="https://www.recodehive.com/assets/images/03-try-out-the-usage-of-CLI-253df56b358da649bc61e1cd1078088f.png" width="1265" height="713" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="documentation">Documentation<a href="https://www.recodehive.com/blog/github-cli-agent#documentation" class="hash-link" aria-label="Direct link to Documentation" title="Direct link to Documentation" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "create new documentation page in docusaurus"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "organize documentation with sidebars"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "create code of conduct for repository"</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="git-workflow">Git Workflow<a href="https://www.recodehive.com/blog/github-cli-agent#git-workflow" class="hash-link" aria-label="Direct link to Git Workflow" title="Direct link to Git Workflow" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "create feature branch for new blog post"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "commit changes to blog content"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "create pull request for documentation updates"</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="repository-maintenance">Repository Maintenance<a href="https://www.recodehive.com/blog/github-cli-agent#repository-maintenance" class="hash-link" aria-label="Direct link to Repository Maintenance" title="Direct link to Repository Maintenance" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "check repository status and pending changes"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "merge feature branch after review"</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="testing--quality">Testing &amp; Quality<a href="https://www.recodehive.com/blog/github-cli-agent#testing--quality" class="hash-link" aria-label="Direct link to Testing &amp; Quality" title="Direct link to Testing &amp; Quality" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "run linting checks on typescript files"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "fix build errors in docusaurus"</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="package-management">Package Management<a href="https://www.recodehive.com/blog/github-cli-agent#package-management" class="hash-link" aria-label="Direct link to Package Management" title="Direct link to Package Management" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "update docusaurus to latest version"</span><br></div></code></pre></div></div>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="development">Development<a href="https://www.recodehive.com/blog/github-cli-agent#development" class="hash-link" aria-label="Direct link to Development" title="Direct link to Development" translate="no">​</a></h3>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "start development server for docusaurus"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "build docusaurus site for production"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "deploy docusaurus site"</span><br></div></code></pre></div></div>
<h1>SEO and metadata</h1>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "optimize SEO for docusaurus website"</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">gh copilot suggest "add metadata to blog posts"</span><br></div></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-resources">🔗 Resources<a href="https://www.recodehive.com/blog/github-cli-agent#-resources" class="hash-link" aria-label="Direct link to 🔗 Resources" title="Direct link to 🔗 Resources" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://docs.github.com/en/copilot/how-tos/use-copilot-agents/use-copilot-cli" target="_blank" rel="noopener noreferrer" class="">Official Documentation</a></li>
<li class=""><a href="https://github.com/github/copilot-cli" target="_blank" rel="noopener noreferrer" class="">Copilot CLI GitHub Repository</a></li>
<li class=""><a href="https://github.com/features/copilot/cli" target="_blank" rel="noopener noreferrer" class="">Copilot Features</a></li>
</ul>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-final-verdict">✅ Final Verdict<a href="https://www.recodehive.com/blog/github-cli-agent#-final-verdict" class="hash-link" aria-label="Direct link to ✅ Final Verdict" title="Direct link to ✅ Final Verdict" translate="no">​</a></h2>
<p><em>GitHub Copilot CLI is the next step in developer productivity bringing AI assistance natively to your terminal. With support for repositories, workflows, testing, and documentation, it simplifies development without taking control away from you.</em></p>
<p>Less setup, more shipping.</p>
<hr>
<div></div>]]></content:encoded>
            <author>sanjay@recodehive.com (Sanjay Viswanthan)</author>
            <category>GitHub</category>
            <category>CLI</category>
            <category>tech</category>
            <category>updates</category>
            <category>Copilot</category>
            <category>Coding</category>
            <category>Assistant</category>
        </item>
        <item>
            <title><![CDATA[N8N: The Future of Workflow Automation]]></title>
            <link>https://www.recodehive.com/blog/n8n-workflow-automation</link>
            <guid>https://www.recodehive.com/blog/n8n-workflow-automation</guid>
            <pubDate>Wed, 17 Sep 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[N8N revolutionizes automation by integrating AI capabilities into visual workflows. Learn how to build intelligent automation pipelines that can process data, make decisions, and interact with multiple services seamlessly.]]></description>
            <content:encoded><![CDATA[<p> 
Hey automation enthusiasts! 🤖</p>
<p>I still remember the moment when I first connected OpenAI's GPT to a Google Sheets workflow in N8N. What started as a simple data processing task suddenly became an intelligent system that could analyze customer feedback, categorize it by sentiment, and automatically generate personalized responses. It was like watching automation evolve from basic "if-this-then-that" logic to something that could actually think.</p>
<p>Today, I want to take you through the fascinating world of N8N AI workflows - how they work, why they're game-changing, and how you can build your own intelligent automation systems that would have seemed like magic just a few years ago.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-n8n-ai-automation">What is N8N AI Automation?<a href="https://www.recodehive.com/blog/n8n-workflow-automation#what-is-n8n-ai-automation" class="hash-link" aria-label="Direct link to What is N8N AI Automation?" title="Direct link to What is N8N AI Automation?" translate="no">​</a></h2>
<p><a href="https://n8n.io/" target="_blank" rel="noopener noreferrer">N8N (pronounced "n-eight-n")</a>
is a powerful workflow automation tool that's taken the integration world by storm. But when you add AI capabilities into the mix, something beautiful happens - your workflows stop being simple data pipelines and start becoming intelligent decision-making systems.</p>
<p>Think of traditional automation as a skilled assembly line worker: fast, reliable, but limited to predefined tasks. N8N AI workflows are more like having a smart assistant who can read, understand, analyze, and make contextual decisions while still maintaining the speed and reliability of automation.</p>
<p>The magic lies in combining N8N's visual workflow builder with AI services like OpenAI, Google's AI Platform, or even custom machine learning models to create workflows that can:</p>
<ul>
<li class="">Understand natural language</li>
<li class="">Make complex decisions based on context</li>
<li class="">Generate human-like responses</li>
<li class="">Analyze patterns in data</li>
<li class="">Adapt to new situations</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architecture-visual-workflows-meet-ai-intelligence">The Architecture: Visual Workflows Meet AI Intelligence<a href="https://www.recodehive.com/blog/n8n-workflow-automation#the-architecture-visual-workflows-meet-ai-intelligence" class="hash-link" aria-label="Direct link to The Architecture: Visual Workflows Meet AI Intelligence" title="Direct link to The Architecture: Visual Workflows Meet AI Intelligence" translate="no">​</a></h2>
<p><img decoding="async" loading="lazy" alt="N8N AI Workflow Architecture" src="https://www.recodehive.com/assets/images/n8n-architecture-example-1ae2940658e4cd90d9f6d98054be2b5d.png" width="1100" height="500" class="img_ev3q"></p>
<p>When you look at an N8N AI workflow, you're seeing a visual representation of an intelligent automation pipeline. Let's break down the key components:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-trigger-nodes-the-starting-point">1. Trigger Nodes: The Starting Point<a href="https://www.recodehive.com/blog/n8n-workflow-automation#1-trigger-nodes-the-starting-point" class="hash-link" aria-label="Direct link to 1. Trigger Nodes: The Starting Point" title="Direct link to 1. Trigger Nodes: The Starting Point" translate="no">​</a></h3>
<p>Every N8N workflow begins with a trigger - the event that sets everything in motion:</p>
<p><strong>Webhook Triggers:</strong></p>
<ul>
<li class="">HTTP requests from external applications</li>
<li class="">Perfect for real-time integrations</li>
<li class="">Can receive data from forms, apps, or other systems</li>
</ul>
<p><strong>Schedule Triggers:</strong></p>
<ul>
<li class="">Time-based automation (cron jobs made visual)</li>
<li class="">Great for periodic data processing</li>
<li class="">Can run daily reports, weekly summaries, etc.</li>
</ul>
<p><strong>App Triggers:</strong></p>
<ul>
<li class="">Direct integration with services (Gmail, Slack, Salesforce)</li>
<li class="">Event-driven automation (new email, message, record created)</li>
<li class="">Real-time responsiveness to external changes</li>
</ul>
<p><strong>Manual Triggers:</strong></p>
<ul>
<li class="">On-demand execution</li>
<li class="">Perfect for testing and ad-hoc processing</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-data-processing-nodes-the-workhorses">2. Data Processing Nodes: The Workhorses<a href="https://www.recodehive.com/blog/n8n-workflow-automation#2-data-processing-nodes-the-workhorses" class="hash-link" aria-label="Direct link to 2. Data Processing Nodes: The Workhorses" title="Direct link to 2. Data Processing Nodes: The Workhorses" translate="no">​</a></h3>
<p>These nodes handle the heavy lifting of data transformation and routing:</p>
<p><strong>HTTP Request Nodes:</strong></p>
<ul>
<li class="">Connect to any REST API</li>
<li class="">Fetch data from external services</li>
<li class="">Send processed results to other systems</li>
</ul>
<p><strong>Function Nodes:</strong></p>
<ul>
<li class="">Custom JavaScript execution</li>
<li class="">Complex data manipulation</li>
<li class="">Custom business logic implementation</li>
</ul>
<p><strong>Conditional Logic Nodes:</strong></p>
<ul>
<li class="">IF/THEN/ELSE branching</li>
<li class="">Route data based on conditions</li>
<li class="">Create intelligent decision trees</li>
</ul>
<p><strong>Data Transformation Nodes:</strong></p>
<ul>
<li class="">Filter, sort, and reshape data</li>
<li class="">Extract specific fields</li>
<li class="">Combine data from multiple sources</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-ai-integration-nodes-the-intelligence-layer">3. AI Integration Nodes: The Intelligence Layer<a href="https://www.recodehive.com/blog/n8n-workflow-automation#3-ai-integration-nodes-the-intelligence-layer" class="hash-link" aria-label="Direct link to 3. AI Integration Nodes: The Intelligence Layer" title="Direct link to 3. AI Integration Nodes: The Intelligence Layer" translate="no">​</a></h3>
<p>This is where the magic happens - nodes that bring artificial intelligence into your workflows:</p>
<p><strong>OpenAI Nodes:</strong></p>
<ul>
<li class="">GPT for text generation and analysis</li>
<li class="">DALL-E for image generation</li>
<li class="">Embeddings for semantic search</li>
<li class="">Fine-tuned models for specific tasks</li>
</ul>
<p><strong>Google AI Nodes:</strong></p>
<ul>
<li class="">Natural Language Processing</li>
<li class="">Translation services</li>
<li class="">Vision AI for image analysis</li>
<li class="">AutoML integration</li>
</ul>
<p><strong>Anthropic Claude Nodes:</strong></p>
<ul>
<li class="">Advanced reasoning and analysis</li>
<li class="">Long-form content generation</li>
<li class="">Code assistance and review</li>
</ul>
<p><strong>Custom AI Model Nodes:</strong></p>
<ul>
<li class="">Integration with your own ML models</li>
<li class="">TensorFlow and PyTorch model serving</li>
<li class="">Edge AI deployment</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-output-nodes-the-final-destination">4. Output Nodes: The Final Destination<a href="https://www.recodehive.com/blog/n8n-workflow-automation#4-output-nodes-the-final-destination" class="hash-link" aria-label="Direct link to 4. Output Nodes: The Final Destination" title="Direct link to 4. Output Nodes: The Final Destination" translate="no">​</a></h3>
<p>Where your processed, AI-enhanced data ends up:</p>
<p><strong>Database Nodes:</strong></p>
<ul>
<li class="">Store results in PostgreSQL, MySQL, MongoDB</li>
<li class="">Build intelligent data lakes</li>
<li class="">Create audit trails</li>
</ul>
<p><strong>Notification Nodes:</strong></p>
<ul>
<li class="">Send Slack messages, emails, SMS</li>
<li class="">Create intelligent alerting systems</li>
<li class="">Deliver personalized communications</li>
</ul>
<p><strong>File System Nodes:</strong></p>
<ul>
<li class="">Generate reports, documents, images</li>
<li class="">Store processed data files</li>
<li class="">Create automated deliverables</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-ai-transforms-traditional-workflows">How AI Transforms Traditional Workflows<a href="https://www.recodehive.com/blog/n8n-workflow-automation#how-ai-transforms-traditional-workflows" class="hash-link" aria-label="Direct link to How AI Transforms Traditional Workflows" title="Direct link to How AI Transforms Traditional Workflows" translate="no">​</a></h2>
<p>Let me show you the difference between traditional automation and AI-powered workflows with a real example:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="traditional-workflow-simple-customer-support-ticket-routing">Traditional Workflow: Simple Customer Support Ticket Routing<a href="https://www.recodehive.com/blog/n8n-workflow-automation#traditional-workflow-simple-customer-support-ticket-routing" class="hash-link" aria-label="Direct link to Traditional Workflow: Simple Customer Support Ticket Routing" title="Direct link to Traditional Workflow: Simple Customer Support Ticket Routing" translate="no">​</a></h3>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">New Email → Extract Sender → Check Department → Forward to Team → Done</span><br></div></code></pre></div></div>
<p>This works, but it's rigid. What if the email is about multiple departments? What if the subject line is unclear?</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="ai-enhanced-workflow-intelligent-customer-support">AI-Enhanced Workflow: Intelligent Customer Support<a href="https://www.recodehive.com/blog/n8n-workflow-automation#ai-enhanced-workflow-intelligent-customer-support" class="hash-link" aria-label="Direct link to AI-Enhanced Workflow: Intelligent Customer Support" title="Direct link to AI-Enhanced Workflow: Intelligent Customer Support" translate="no">​</a></h3>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">New Email → AI Analysis (Extract Intent, Sentiment, Urgency) → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Smart Routing (Consider Context, History, Workload) → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Generate Response Draft → Human Review → Send Personalized Response</span><br></div></code></pre></div></div>
<p>The AI version can:</p>
<ul>
<li class="">Understand the actual meaning behind customer messages</li>
<li class="">Consider emotional context (frustrated vs. curious customers)</li>
<li class="">Route based on content, not just keywords</li>
<li class="">Generate contextual response drafts</li>
<li class="">Learn from previous interactions</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="core-ai-workflow-patterns">Core AI Workflow Patterns<a href="https://www.recodehive.com/blog/n8n-workflow-automation#core-ai-workflow-patterns" class="hash-link" aria-label="Direct link to Core AI Workflow Patterns" title="Direct link to Core AI Workflow Patterns" translate="no">​</a></h2>
<p>After building dozens of AI workflows, I've identified several powerful patterns that you can adapt for almost any use case:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-the-content-intelligence-pipeline">1. The Content Intelligence Pipeline<a href="https://www.recodehive.com/blog/n8n-workflow-automation#1-the-content-intelligence-pipeline" class="hash-link" aria-label="Direct link to 1. The Content Intelligence Pipeline" title="Direct link to 1. The Content Intelligence Pipeline" translate="no">​</a></h3>
<p><strong>Use Case:</strong> Automatically process and categorize incoming content</p>
<p><strong>Flow Structure:</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Content Trigger → AI Content Analysis → Categorization → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Sentiment Analysis → Keyword Extraction → Storage + Routing</span><br></div></code></pre></div></div>
<p><strong>Real-World Applications:</strong></p>
<ul>
<li class="">Social media monitoring and response</li>
<li class="">Customer feedback processing</li>
<li class="">Content moderation and filtering</li>
<li class="">News article categorization</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-the-decision-intelligence-framework">2. The Decision Intelligence Framework<a href="https://www.recodehive.com/blog/n8n-workflow-automation#2-the-decision-intelligence-framework" class="hash-link" aria-label="Direct link to 2. The Decision Intelligence Framework" title="Direct link to 2. The Decision Intelligence Framework" translate="no">​</a></h3>
<p><strong>Use Case:</strong> Make complex decisions based on multiple data sources</p>
<p><strong>Flow Structure:</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Data Collection → AI Analysis → Risk Assessment → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Decision Matrix → Automated Action + Human Notification</span><br></div></code></pre></div></div>
<p><strong>Real-World Applications:</strong></p>
<ul>
<li class="">Loan approval workflows</li>
<li class="">Inventory restocking decisions</li>
<li class="">Quality control assessment</li>
<li class="">Investment recommendations</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-the-communication-intelligence-system">3. The Communication Intelligence System<a href="https://www.recodehive.com/blog/n8n-workflow-automation#3-the-communication-intelligence-system" class="hash-link" aria-label="Direct link to 3. The Communication Intelligence System" title="Direct link to 3. The Communication Intelligence System" translate="no">​</a></h3>
<p><strong>Use Case:</strong> Generate and personalize communications at scale</p>
<p><strong>Flow Structure:</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Trigger Event → Context Gathering → AI Content Generation → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Personalization → Multi-Channel Delivery → Response Tracking</span><br></div></code></pre></div></div>
<p><strong>Real-World Applications:</strong></p>
<ul>
<li class="">Personalized marketing campaigns</li>
<li class="">Customer onboarding sequences</li>
<li class="">Support ticket responses</li>
<li class="">Sales follow-up automation</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-the-data-intelligence-engine">4. The Data Intelligence Engine<a href="https://www.recodehive.com/blog/n8n-workflow-automation#4-the-data-intelligence-engine" class="hash-link" aria-label="Direct link to 4. The Data Intelligence Engine" title="Direct link to 4. The Data Intelligence Engine" translate="no">​</a></h3>
<p><strong>Use Case:</strong> Extract insights and patterns from large datasets</p>
<p><strong>Flow Structure:</strong></p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Data Ingestion → AI Analysis → Pattern Recognition → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Insight Generation → Visualization → Action Recommendations</span><br></div></code></pre></div></div>
<p><strong>Real-World Applications:</strong></p>
<ul>
<li class="">Sales trend analysis</li>
<li class="">Customer behavior prediction</li>
<li class="">Operational efficiency optimization</li>
<li class="">Risk pattern detection</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="real-world-use-cases-and-success-stories">Real-World Use Cases and Success Stories<a href="https://www.recodehive.com/blog/n8n-workflow-automation#real-world-use-cases-and-success-stories" class="hash-link" aria-label="Direct link to Real-World Use Cases and Success Stories" title="Direct link to Real-World Use Cases and Success Stories" translate="no">​</a></h2>
<p>Here are some powerful AI workflows I've seen in production:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-e-commerce-intelligence-platform">1. E-commerce Intelligence Platform<a href="https://www.recodehive.com/blog/n8n-workflow-automation#1-e-commerce-intelligence-platform" class="hash-link" aria-label="Direct link to 1. E-commerce Intelligence Platform" title="Direct link to 1. E-commerce Intelligence Platform" translate="no">​</a></h3>
<p><strong>Challenge:</strong> Online store receiving thousands of product reviews daily
<strong>Solution:</strong> AI workflow that analyzes reviews, extracts insights, and automatically updates product descriptions</p>
<p><strong>Results:</strong></p>
<ul>
<li class="">95% reduction in manual review processing time</li>
<li class="">40% improvement in product page conversion rates</li>
<li class="">Automatic identification of product issues before they become major problems</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-hr-recruitment-automation">2. HR Recruitment Automation<a href="https://www.recodehive.com/blog/n8n-workflow-automation#2-hr-recruitment-automation" class="hash-link" aria-label="Direct link to 2. HR Recruitment Automation" title="Direct link to 2. HR Recruitment Automation" translate="no">​</a></h3>
<p><strong>Challenge:</strong> Screening hundreds of resumes for multiple positions
<strong>Solution:</strong> AI workflow that analyzes resumes, matches them to job requirements, and generates personalized outreach</p>
<p><strong>Results:</strong></p>
<ul>
<li class="">80% reduction in initial screening time</li>
<li class="">60% improvement in candidate-job fit quality</li>
<li class="">Personalized communication that increased response rates by 45%</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-financial-risk-assessment">3. Financial Risk Assessment<a href="https://www.recodehive.com/blog/n8n-workflow-automation#3-financial-risk-assessment" class="hash-link" aria-label="Direct link to 3. Financial Risk Assessment" title="Direct link to 3. Financial Risk Assessment" translate="no">​</a></h3>
<p><strong>Challenge:</strong> Manually reviewing loan applications across multiple criteria
<strong>Solution:</strong> AI workflow that combines financial data analysis with behavioral pattern recognition</p>
<p><strong>Results:</strong></p>
<ul>
<li class="">70% faster decision-making process</li>
<li class="">25% improvement in risk prediction accuracy</li>
<li class="">Consistent evaluation criteria across all applications</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-content-marketing-automation">4. Content Marketing Automation<a href="https://www.recodehive.com/blog/n8n-workflow-automation#4-content-marketing-automation" class="hash-link" aria-label="Direct link to 4. Content Marketing Automation" title="Direct link to 4. Content Marketing Automation" translate="no">​</a></h3>
<p><strong>Challenge:</strong> Creating personalized content for different audience segments
<strong>Solution:</strong> AI workflow that analyzes audience data and generates tailored content automatically</p>
<p><strong>Results:</strong></p>
<ul>
<li class="">10x increase in content production capacity</li>
<li class="">35% improvement in engagement rates</li>
<li class="">Consistent brand voice across all generated content</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-integration-ecosystem-n8ns-superpower">The Integration Ecosystem: N8N's Superpower<a href="https://www.recodehive.com/blog/n8n-workflow-automation#the-integration-ecosystem-n8ns-superpower" class="hash-link" aria-label="Direct link to The Integration Ecosystem: N8N's Superpower" title="Direct link to The Integration Ecosystem: N8N's Superpower" translate="no">​</a></h2>
<p>What makes N8N AI workflows truly powerful is the vast ecosystem of integrations available:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="popular-service-integrations">Popular Service Integrations:<a href="https://www.recodehive.com/blog/n8n-workflow-automation#popular-service-integrations" class="hash-link" aria-label="Direct link to Popular Service Integrations:" title="Direct link to Popular Service Integrations:" translate="no">​</a></h3>
<p><strong>Communication Platforms:</strong></p>
<ul>
<li class="">Slack, Discord, Microsoft Teams</li>
<li class="">Email (Gmail, Outlook, SendGrid)</li>
<li class="">SMS (Twilio, Amazon SNS)</li>
</ul>
<p><strong>Data Stores:</strong></p>
<ul>
<li class="">Google Sheets, Airtable</li>
<li class="">Databases (PostgreSQL, MySQL, MongoDB)</li>
<li class="">Cloud Storage (Google Drive, Dropbox, AWS S3)</li>
</ul>
<p><strong>Business Applications:</strong></p>
<ul>
<li class="">CRM (Salesforce, HubSpot, Pipedrive)</li>
<li class="">Project Management (Notion, Asana, Jira)</li>
<li class="">E-commerce (Shopify, WooCommerce)</li>
</ul>
<p><strong>AI and ML Services:</strong></p>
<ul>
<li class="">OpenAI (GPT, DALL-E, Whisper)</li>
<li class="">Google AI (Vision, Language, Translation)</li>
<li class="">AWS AI (Comprehend, Rekognition, Textract)</li>
<li class="">Custom ML models via API</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="creating-intelligent-integration-chains">Creating Intelligent Integration Chains:<a href="https://www.recodehive.com/blog/n8n-workflow-automation#creating-intelligent-integration-chains" class="hash-link" aria-label="Direct link to Creating Intelligent Integration Chains:" title="Direct link to Creating Intelligent Integration Chains:" translate="no">​</a></h3>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Salesforce Lead → AI Qualification → Google Sheets Update → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Slack Notification → Email Sequence → Calendar Booking → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Follow-up Automation</span><br></div></code></pre></div></div>
<p>Each step can be enhanced with AI intelligence, creating a seamless experience that feels magical to end users.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="future-trends-where-ai-workflows-are-heading">Future Trends: Where AI Workflows Are Heading<a href="https://www.recodehive.com/blog/n8n-workflow-automation#future-trends-where-ai-workflows-are-heading" class="hash-link" aria-label="Direct link to Future Trends: Where AI Workflows Are Heading" title="Direct link to Future Trends: Where AI Workflows Are Heading" translate="no">​</a></h2>
<p>The world of AI automation is evolving rapidly. Here are the trends I'm watching:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-multi-modal-ai-integration">1. Multi-Modal AI Integration<a href="https://www.recodehive.com/blog/n8n-workflow-automation#1-multi-modal-ai-integration" class="hash-link" aria-label="Direct link to 1. Multi-Modal AI Integration" title="Direct link to 1. Multi-Modal AI Integration" translate="no">​</a></h3>
<p>Workflows that can process text, images, audio, and video in the same pipeline:</p>
<div class="language-text codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-text codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Voice Input → Speech-to-Text → Intent Analysis → </span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Image Processing → Decision Making → Multi-Format Response</span><br></div></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-autonomous-workflow-optimization">2. Autonomous Workflow Optimization<a href="https://www.recodehive.com/blog/n8n-workflow-automation#2-autonomous-workflow-optimization" class="hash-link" aria-label="Direct link to 2. Autonomous Workflow Optimization" title="Direct link to 2. Autonomous Workflow Optimization" translate="no">​</a></h3>
<p>AI systems that can optimize their own workflows:</p>
<ul>
<li class="">Automatically adjust parameters based on performance</li>
<li class="">Suggest new integration opportunities</li>
<li class="">Identify bottlenecks and propose solutions</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-collaborative-ai-workflows">3. Collaborative AI Workflows<a href="https://www.recodehive.com/blog/n8n-workflow-automation#3-collaborative-ai-workflows" class="hash-link" aria-label="Direct link to 3. Collaborative AI Workflows" title="Direct link to 3. Collaborative AI Workflows" translate="no">​</a></h3>
<p>Multiple AI agents working together within a single workflow:</p>
<ul>
<li class="">Specialist AIs for different domains</li>
<li class="">Consensus-building among AI models</li>
<li class="">Dynamic role assignment based on task requirements</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-edge-ai-integration">4. Edge AI Integration<a href="https://www.recodehive.com/blog/n8n-workflow-automation#4-edge-ai-integration" class="hash-link" aria-label="Direct link to 4. Edge AI Integration" title="Direct link to 4. Edge AI Integration" translate="no">​</a></h3>
<p>Running AI models directly within N8N workflows:</p>
<ul>
<li class="">Reduced latency and costs</li>
<li class="">Enhanced privacy and data security</li>
<li class="">Offline operation capabilities</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="getting-started-your-ai-workflow-journey">Getting Started: Your AI Workflow Journey<a href="https://www.recodehive.com/blog/n8n-workflow-automation#getting-started-your-ai-workflow-journey" class="hash-link" aria-label="Direct link to Getting Started: Your AI Workflow Journey" title="Direct link to Getting Started: Your AI Workflow Journey" translate="no">​</a></h2>
<p>Ready to build your first AI workflow? Here's your roadmap:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-1-foundation-building-week-1-2">Phase 1: Foundation Building (Week 1-2)<a href="https://www.recodehive.com/blog/n8n-workflow-automation#phase-1-foundation-building-week-1-2" class="hash-link" aria-label="Direct link to Phase 1: Foundation Building (Week 1-2)" title="Direct link to Phase 1: Foundation Building (Week 1-2)" translate="no">​</a></h3>
<ol>
<li class="">Set up N8N (self-hosted or cloud)</li>
<li class="">Create your first simple workflow without AI</li>
<li class="">Learn the basic nodes and flow patterns</li>
<li class="">Connect to your most-used services</li>
</ol>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-2-ai-integration-week-3-4">Phase 2: AI Integration (Week 3-4)<a href="https://www.recodehive.com/blog/n8n-workflow-automation#phase-2-ai-integration-week-3-4" class="hash-link" aria-label="Direct link to Phase 2: AI Integration (Week 3-4)" title="Direct link to Phase 2: AI Integration (Week 3-4)" translate="no">​</a></h3>
<ol>
<li class="">Add your first AI node (start with OpenAI)</li>
<li class="">Build a simple text analysis workflow</li>
<li class="">Experiment with different prompts and parameters</li>
<li class="">Learn prompt engineering basics</li>
</ol>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-3-advanced-patterns-month-2">Phase 3: Advanced Patterns (Month 2)<a href="https://www.recodehive.com/blog/n8n-workflow-automation#phase-3-advanced-patterns-month-2" class="hash-link" aria-label="Direct link to Phase 3: Advanced Patterns (Month 2)" title="Direct link to Phase 3: Advanced Patterns (Month 2)" translate="no">​</a></h3>
<ol>
<li class="">Implement conditional logic based on AI results</li>
<li class="">Create multi-step AI processing workflows</li>
<li class="">Add error handling and fallback logic</li>
<li class="">Optimize for performance and cost</li>
</ol>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="phase-4-production-deployment-month-3">Phase 4: Production Deployment (Month 3)<a href="https://www.recodehive.com/blog/n8n-workflow-automation#phase-4-production-deployment-month-3" class="hash-link" aria-label="Direct link to Phase 4: Production Deployment (Month 3)" title="Direct link to Phase 4: Production Deployment (Month 3)" translate="no">​</a></h3>
<ol>
<li class="">Monitor and log workflow performance</li>
<li class="">Implement proper security measures</li>
<li class="">Create comprehensive documentation</li>
<li class="">Train your team on workflow management</li>
</ol>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="resources-to-accelerate-your-learning">Resources to Accelerate Your Learning:<a href="https://www.recodehive.com/blog/n8n-workflow-automation#resources-to-accelerate-your-learning" class="hash-link" aria-label="Direct link to Resources to Accelerate Your Learning:" title="Direct link to Resources to Accelerate Your Learning:" translate="no">​</a></h3>
<p><strong>Documentation and Tutorials:</strong></p>
<ul>
<li class="">N8N official documentation and community forum</li>
<li class="">AI service provider documentation (OpenAI, Google AI, etc.)</li>
<li class="">Workflow template galleries and examples</li>
</ul>
<p><strong>Community and Support:</strong></p>
<ul>
<li class="">N8N Discord community</li>
<li class="">GitHub repositories with example workflows</li>
<li class="">YouTube tutorials and case studies</li>
</ul>
<p><strong>Best Practice Guides:</strong></p>
<ul>
<li class="">Security considerations for API keys and sensitive data</li>
<li class="">Performance optimization techniques</li>
<li class="">Cost management strategies</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion-the-future-is-intelligent-automation">Conclusion: The Future is Intelligent Automation<a href="https://www.recodehive.com/blog/n8n-workflow-automation#conclusion-the-future-is-intelligent-automation" class="hash-link" aria-label="Direct link to Conclusion: The Future is Intelligent Automation" title="Direct link to Conclusion: The Future is Intelligent Automation" translate="no">​</a></h2>
<p>AI workflows in N8N represent a fundamental shift in how we think about automation. We're moving from rigid, rule-based systems to intelligent, adaptive processes that can understand context, make decisions, and learn from experience.</p>
<p>The beauty of this technology lies not just in its technical capabilities, but in how it democratizes artificial intelligence. You don't need to be a data scientist or machine learning engineer to build sophisticated AI systems. With N8N's visual interface and the growing ecosystem of AI services, anyone can create intelligent automation that would have required a team of specialists just a few years ago.</p>
<p>Whether you're automating customer service, processing business data, generating content, or solving domain-specific challenges, AI workflows give you the power to build systems that are not just fast and reliable, but genuinely intelligent.</p>
<p>The future belongs to organizations that can seamlessly blend human creativity with artificial intelligence, and N8N AI workflows are the bridge that makes this possible. So start small, experiment freely, and prepare to be amazed by what you can build when you combine the power of automation with the intelligence of AI.</p>
<hr>
<p><em>The next time someone asks you about the future of automation, show them an N8N AI workflow in action. Watch their expression change from skepticism to wonder as they realize we're not just talking about the future anymore - we're living in it. Happy automating!</em></p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <category>N8N</category>
            <category>AI Automation</category>
            <category>Workflow Automation</category>
            <category>No-Code</category>
            <category>Integration</category>
            <category>Machine Learning</category>
            <category>API Integration</category>
        </item>
        <item>
            <title><![CDATA[Spark Architecture Explained]]></title>
            <link>https://www.recodehive.com/blog/spark-architecture</link>
            <guid>https://www.recodehive.com/blog/spark-architecture</guid>
            <pubDate>Fri, 22 Aug 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[Apache Spark is a fast, open-source big data framework that leverages in-memory computing for high performance. Its architecture powers scalable distributed processing across clusters, making it essential for analytics and machine learning.]]></description>
            <content:encoded><![CDATA[<p> </p>
<p>Hey there, fellow data enthusiasts! 👋</p>
<p>I remember the first time I encountered a Spark architecture diagram. It looked like a complex web of boxes and arrows that seemed to communicate in some secret distributed computing language. But once I understood what each component actually does and how they work together, everything clicked into place.</p>
<p>Today, I want to walk you through Spark's architecture in a way that I wish someone had explained it to me back then - focusing on the core components and how this beautiful system actually works under the hood.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-is-apache-spark">What is Apache Spark?<a href="https://www.recodehive.com/blog/spark-architecture#what-is-apache-spark" class="hash-link" aria-label="Direct link to What is Apache Spark?" title="Direct link to What is Apache Spark?" translate="no">​</a></h2>
<p>Before diving into the architecture, let's establish what we're dealing with. Apache Spark is an open-source, distributed computing framework designed to process massive datasets across clusters of computers. Think of it as a coordinator that can take your data processing job and intelligently distribute it across multiple machines to get the work done faster.</p>
<p>The key insight that makes Spark special? It keeps data in memory between operations whenever possible, which is why it can be dramatically faster than traditional batch processing systems.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-big-picture-high-level-architecture">The Big Picture: High-Level Architecture<a href="https://www.recodehive.com/blog/spark-architecture#the-big-picture-high-level-architecture" class="hash-link" aria-label="Direct link to The Big Picture: High-Level Architecture" title="Direct link to The Big Picture: High-Level Architecture" translate="no">​</a></h2>
<p><img decoding="async" loading="lazy" alt="Spark Architecture" src="https://www.recodehive.com/assets/images/07-spark_architecture-e73d0350f6f913d028c171532a18cc2a.png" width="596" height="286" class="img_ev3q"></p>
<p>When you look at Spark's architecture, you're essentially looking at a well-orchestrated system with three main types of components working together:</p>
<ol>
<li class=""><strong>Driver Program</strong> - The mastermind that coordinates everything</li>
<li class=""><strong>Cluster Manager</strong> - The resource allocator</li>
<li class=""><strong>Executors</strong> - The workers that do the actual processing</li>
</ol>
<p>Let's break down each of these and understand how they collaborate.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="core-components-deep-dive">Core Components Deep Dive<a href="https://www.recodehive.com/blog/spark-architecture#core-components-deep-dive" class="hash-link" aria-label="Direct link to Core Components Deep Dive" title="Direct link to Core Components Deep Dive" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-the-driver-program-your-applications-brain">1. The Driver Program: Your Application's Brain<a href="https://www.recodehive.com/blog/spark-architecture#1-the-driver-program-your-applications-brain" class="hash-link" aria-label="Direct link to 1. The Driver Program: Your Application's Brain" title="Direct link to 1. The Driver Program: Your Application's Brain" translate="no">​</a></h3>
<p>The Driver Program is where your Spark application begins and ends. When you write a Spark program and run it, you're essentially creating a driver program. Here's what makes it the brain of the operation:</p>
<p><strong>What the Driver Does:</strong></p>
<ul>
<li class="">Contains your main() function and defines RDDs(Resilient Distributed Datasets) and operations on them</li>
<li class="">Converts your high-level operations into a DAG (Directed Acyclic Graph) of tasks</li>
<li class="">Schedules tasks across the cluster</li>
<li class="">Coordinates with the cluster manager to get resources</li>
<li class="">Collects results from executors and returns final results</li>
</ul>
<p><strong>Think of it this way:</strong> If your Spark application were a restaurant, the Driver would be the head chef who takes orders (your code), breaks them down into specific cooking tasks, assigns those tasks to kitchen staff (executors), and ensures everything comes together for the final dish.</p>
<p>The driver runs in its own JVM(Java Virtual Machine) process and maintains all the metadata about your Spark application throughout its lifetime.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-cluster-manager-the-resource-referee">2. Cluster Manager: The Resource Referee<a href="https://www.recodehive.com/blog/spark-architecture#2-cluster-manager-the-resource-referee" class="hash-link" aria-label="Direct link to 2. Cluster Manager: The Resource Referee" title="Direct link to 2. Cluster Manager: The Resource Referee" translate="no">​</a></h3>
<p>The Cluster Manager sits between your driver and the actual compute resources. Its job is to allocate and manage resources across the cluster. Spark is flexible and works with several cluster managers:</p>
<p><strong>Standalone Cluster Manager:</strong></p>
<ul>
<li class="">Spark's built-in cluster manager</li>
<li class="">Simple to set up and understand</li>
<li class="">Great for dedicated Spark clusters</li>
</ul>
<p><strong>Apache YARN (Yet Another Resource Negotiator):</strong></p>
<ul>
<li class="">Hadoop's resource manager</li>
<li class="">Perfect if you're in a Hadoop ecosystem</li>
<li class="">Allows resource sharing between Spark and other Hadoop applications</li>
</ul>
<p><strong>Apache Mesos:</strong></p>
<ul>
<li class="">A general-purpose cluster manager</li>
<li class="">Can handle multiple frameworks beyond just Spark</li>
<li class="">Good for mixed workload environments</li>
</ul>
<p><strong>Kubernetes:</strong></p>
<ul>
<li class="">The modern container orchestration platform</li>
<li class="">Increasingly popular for new deployments</li>
<li class="">Excellent for cloud-native environments</li>
</ul>
<p><strong>The key point:</strong> The cluster manager's job is resource allocation - it doesn't care what your application does, just how much CPU and memory it needs.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-executors-the-workhorses">3. Executors: The Workhorses<a href="https://www.recodehive.com/blog/spark-architecture#3-executors-the-workhorses" class="hash-link" aria-label="Direct link to 3. Executors: The Workhorses" title="Direct link to 3. Executors: The Workhorses" translate="no">​</a></h3>
<p>Executors are the processes that actually run your tasks and store data for your application. Each executor runs in its own JVM process and can run multiple tasks concurrently using threads.</p>
<p><strong>What Executors Do:</strong></p>
<ul>
<li class="">Execute tasks sent from the driver</li>
<li class="">Store computation results in memory or disk storage</li>
<li class="">Provide in-memory storage for cached RDDs/DataFrames</li>
<li class="">Report heartbeat and task status back to the driver</li>
</ul>
<p><strong>Key Characteristics:</strong></p>
<ul>
<li class="">Each executor has a fixed number of cores and amount of memory</li>
<li class="">Executors are launched at the start of a Spark application and run for the entire lifetime</li>
<li class="">If an executor fails, Spark can launch new ones and recompute lost data</li>
</ul>
<p>Think of executors as skilled workers in our restaurant analogy - they can handle multiple cooking tasks simultaneously and have their own workspace (memory) to store ingredients and intermediate results.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-these-components-work-together-the-execution-flow">How These Components Work Together: The Execution Flow<a href="https://www.recodehive.com/blog/spark-architecture#how-these-components-work-together-the-execution-flow" class="hash-link" aria-label="Direct link to How These Components Work Together: The Execution Flow" title="Direct link to How These Components Work Together: The Execution Flow" translate="no">​</a></h2>
<p>Now that we know the players, let's see how they orchestrate a typical Spark application:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-1-application-submission">Step 1: Application Submission<a href="https://www.recodehive.com/blog/spark-architecture#step-1-application-submission" class="hash-link" aria-label="Direct link to Step 1: Application Submission" title="Direct link to Step 1: Application Submission" translate="no">​</a></h3>
<p>When you submit a Spark application, the driver program starts up and contacts the cluster manager requesting resources for executors.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-2-resource-allocation">Step 2: Resource Allocation<a href="https://www.recodehive.com/blog/spark-architecture#step-2-resource-allocation" class="hash-link" aria-label="Direct link to Step 2: Resource Allocation" title="Direct link to Step 2: Resource Allocation" translate="no">​</a></h3>
<p>The cluster manager examines available resources and launches executor processes on worker nodes across the cluster.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-3-task-planning">Step 3: Task Planning<a href="https://www.recodehive.com/blog/spark-architecture#step-3-task-planning" class="hash-link" aria-label="Direct link to Step 3: Task Planning" title="Direct link to Step 3: Task Planning" translate="no">​</a></h3>
<p>The driver analyzes your code and creates a logical execution plan. It breaks down operations into stages and tasks that can be executed in parallel.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-4-task-distribution">Step 4: Task Distribution<a href="https://www.recodehive.com/blog/spark-architecture#step-4-task-distribution" class="hash-link" aria-label="Direct link to Step 4: Task Distribution" title="Direct link to Step 4: Task Distribution" translate="no">​</a></h3>
<p>The driver sends tasks to executors. Each task operates on a partition of data, and multiple tasks can run in parallel across different executors.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-5-execution-and-communication">Step 5: Execution and Communication<a href="https://www.recodehive.com/blog/spark-architecture#step-5-execution-and-communication" class="hash-link" aria-label="Direct link to Step 5: Execution and Communication" title="Direct link to Step 5: Execution and Communication" translate="no">​</a></h3>
<p>Executors run the tasks, storing intermediate results and communicating progress back to the driver. The driver coordinates everything and handles any failures.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="step-6-result-collection">Step 6: Result Collection<a href="https://www.recodehive.com/blog/spark-architecture#step-6-result-collection" class="hash-link" aria-label="Direct link to Step 6: Result Collection" title="Direct link to Step 6: Result Collection" translate="no">​</a></h3>
<p>Once all tasks complete, the driver collects results and returns the final output to your application.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="understanding-rdds-the-foundation">Understanding RDDs: The Foundation<a href="https://www.recodehive.com/blog/spark-architecture#understanding-rdds-the-foundation" class="hash-link" aria-label="Direct link to Understanding RDDs: The Foundation" title="Direct link to Understanding RDDs: The Foundation" translate="no">​</a></h2>
<p>At the heart of Spark's architecture lies the concept of Resilient Distributed Datasets (RDDs). Understanding RDDs is crucial to understanding how Spark actually works.</p>
<p><strong>What makes RDDs special:</strong></p>
<p><strong>Resilient:</strong> RDDs can automatically recover from node failures. Spark remembers how each RDD was created (its lineage) and can rebuild lost partitions.</p>
<p><strong>Distributed:</strong> RDD data is automatically partitioned and distributed across multiple nodes in the cluster.</p>
<p><strong>Dataset:</strong> At the end of the day, it's still just a collection of your data - but with superpowers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="rdd-operations-transformations-vs-actions">RDD Operations: Transformations vs Actions<a href="https://www.recodehive.com/blog/spark-architecture#rdd-operations-transformations-vs-actions" class="hash-link" aria-label="Direct link to RDD Operations: Transformations vs Actions" title="Direct link to RDD Operations: Transformations vs Actions" translate="no">​</a></h3>
<p>RDDs support two types of operations, and understanding the difference is crucial:</p>
<p><strong>Transformations</strong> (Lazy):</p>
<div class="language-scala codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-scala codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">val filtered = data.filter(x =&gt; x &gt; 10)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val mapped = filtered.map(x =&gt; x * 2)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val grouped = mapped.groupByKey()</span><br></div></code></pre></div></div>
<p>These operations don't actually execute immediately. Spark just builds up a computation graph.</p>
<p><strong>Actions</strong> (Eager):</p>
<div class="language-scala codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-scala codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">val results = grouped.collect()  // Brings data to driver</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val count = filtered.count()     // Returns number of elements</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">grouped.saveAsTextFile("hdfs://...")  // Saves to storage</span><br></div></code></pre></div></div>
<p>Actions trigger the actual execution of all the transformations in the lineage.</p>
<p>This lazy evaluation allows Spark to optimize the entire computation pipeline before executing anything.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-dag-sparks-optimization-engine">The DAG: Spark's Optimization Engine<a href="https://www.recodehive.com/blog/spark-architecture#the-dag-sparks-optimization-engine" class="hash-link" aria-label="Direct link to The DAG: Spark's Optimization Engine" title="Direct link to The DAG: Spark's Optimization Engine" translate="no">​</a></h2>
<p>One of Spark's most elegant features is how it converts your operations into a Directed Acyclic Graph (DAG) for optimal execution.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-dag-optimization-works">How DAG Optimization Works<a href="https://www.recodehive.com/blog/spark-architecture#how-dag-optimization-works" class="hash-link" aria-label="Direct link to How DAG Optimization Works" title="Direct link to How DAG Optimization Works" translate="no">​</a></h3>
<p>When you chain multiple transformations together, Spark doesn't execute them immediately. Instead, it builds a DAG that represents the computation. This allows for powerful optimizations:</p>
<p><strong>Pipelining:</strong> Multiple transformations that don't require data shuffling can be combined into a single stage and executed together.</p>
<p><strong>Stage Boundaries:</strong> Spark creates stage boundaries at operations that require data shuffling (like <code>groupByKey</code>, <code>join</code>, or <code>repartition</code>).</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="stages-and-tasks-breakdown">Stages and Tasks Breakdown<a href="https://www.recodehive.com/blog/spark-architecture#stages-and-tasks-breakdown" class="hash-link" aria-label="Direct link to Stages and Tasks Breakdown" title="Direct link to Stages and Tasks Breakdown" translate="no">​</a></h3>
<p><strong>Stage:</strong> A set of tasks that can all be executed without data shuffling. All tasks in a stage can run in parallel.</p>
<p><strong>Task:</strong> The smallest unit of work in Spark. Each task processes one partition of data.</p>
<p><strong>Wide vs Narrow Dependencies:</strong></p>
<ul>
<li class=""><strong>Narrow Dependencies:</strong> Each partition of child RDD depends on a constant number of parent partitions (like <code>map</code>, <code>filter</code>)</li>
<li class=""><strong>Wide Dependencies:</strong> Each partition of child RDD may depend on multiple parent partitions (like <code>groupByKey</code>, <code>join</code>)</li>
</ul>
<p>Wide dependencies create stage boundaries because they require shuffling data across the network.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="memory-management-where-the-magic-happens">Memory Management: Where the Magic Happens<a href="https://www.recodehive.com/blog/spark-architecture#memory-management-where-the-magic-happens" class="hash-link" aria-label="Direct link to Memory Management: Where the Magic Happens" title="Direct link to Memory Management: Where the Magic Happens" translate="no">​</a></h2>
<p>Spark's memory management is what gives it the speed advantage over traditional batch processing systems. Here's how it works:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="memory-regions">Memory Regions<a href="https://www.recodehive.com/blog/spark-architecture#memory-regions" class="hash-link" aria-label="Direct link to Memory Regions" title="Direct link to Memory Regions" translate="no">​</a></h3>
<p>Spark divides executor memory into several regions:</p>
<p><strong>Storage Memory (60% by default):</strong></p>
<ul>
<li class="">Used for caching RDDs/DataFrames</li>
<li class="">LRU eviction when space is needed</li>
<li class="">Can borrow from execution memory when available</li>
</ul>
<p><strong>Execution Memory (20% by default):</strong></p>
<ul>
<li class="">Used for computation in shuffles, joins, sorts, aggregations</li>
<li class="">Can borrow from storage memory when needed</li>
</ul>
<p><strong>User Memory (20% by default):</strong></p>
<ul>
<li class="">For user data structures and internal metadata</li>
<li class="">Not managed by Spark</li>
</ul>
<p><strong>Reserved Memory (300MB by default):</strong></p>
<ul>
<li class="">System reserved memory for Spark's internal objects</li>
</ul>
<p>The beautiful thing about this system is that storage and execution memory can dynamically borrow from each other based on current needs.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-unified-stack-multiple-apis-one-engine">The Unified Stack: Multiple APIs, One Engine<a href="https://www.recodehive.com/blog/spark-architecture#the-unified-stack-multiple-apis-one-engine" class="hash-link" aria-label="Direct link to The Unified Stack: Multiple APIs, One Engine" title="Direct link to The Unified Stack: Multiple APIs, One Engine" translate="no">​</a></h2>
<p>What makes Spark truly powerful is that it provides multiple high-level APIs that all run on the same core engine:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="spark-core">Spark Core<a href="https://www.recodehive.com/blog/spark-architecture#spark-core" class="hash-link" aria-label="Direct link to Spark Core" title="Direct link to Spark Core" translate="no">​</a></h3>
<p>The foundation that provides:</p>
<ul>
<li class="">Basic I/O functionality</li>
<li class="">Task scheduling and memory management</li>
<li class="">Fault tolerance</li>
<li class="">RDD abstraction</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="spark-sql">Spark SQL<a href="https://www.recodehive.com/blog/spark-architecture#spark-sql" class="hash-link" aria-label="Direct link to Spark SQL" title="Direct link to Spark SQL" translate="no">​</a></h3>
<ul>
<li class="">SQL queries on structured data</li>
<li class="">DataFrame and Dataset APIs</li>
<li class="">Catalyst query optimizer</li>
<li class="">Integration with various data sources</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="spark-streaming">Spark Streaming<a href="https://www.recodehive.com/blog/spark-architecture#spark-streaming" class="hash-link" aria-label="Direct link to Spark Streaming" title="Direct link to Spark Streaming" translate="no">​</a></h3>
<ul>
<li class="">Real-time stream processing</li>
<li class="">Micro-batch processing model</li>
<li class="">Integration with streaming sources like Kafka</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="mllib">MLlib<a href="https://www.recodehive.com/blog/spark-architecture#mllib" class="hash-link" aria-label="Direct link to MLlib" title="Direct link to MLlib" translate="no">​</a></h3>
<ul>
<li class="">Distributed machine learning algorithms</li>
<li class="">Feature transformation utilities</li>
<li class="">Model evaluation and tuning</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="graphx">GraphX<a href="https://www.recodehive.com/blog/spark-architecture#graphx" class="hash-link" aria-label="Direct link to GraphX" title="Direct link to GraphX" translate="no">​</a></h3>
<ul>
<li class="">Graph processing and analysis</li>
<li class="">Built-in graph algorithms</li>
<li class="">Graph-parallel computation</li>
</ul>
<p>The key insight: all of these APIs compile down to the same core RDD operations, so they all benefit from Spark's optimization engine and can interoperate seamlessly.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="putting-it-all-together">Putting It All Together<a href="https://www.recodehive.com/blog/spark-architecture#putting-it-all-together" class="hash-link" aria-label="Direct link to Putting It All Together" title="Direct link to Putting It All Together" translate="no">​</a></h2>
<p>Now that we've covered all the components, let's see how they work together in a real example:</p>
<div class="language-scala codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-scala codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">// This creates RDDs but doesn't execute anything yet</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val textFile = spark.textFile("hdfs://large-file.txt")</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val words = textFile.flatMap(line =&gt; line.split(" "))</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val wordCounts = words.map(word =&gt; (word, 1))</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val aggregated = wordCounts.reduceByKey(_ + _)</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">// This action triggers execution of the entire pipeline</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">val results = aggregated.collect()</span><br></div></code></pre></div></div>
<p><strong>What happens behind the scenes:</strong></p>
<ol>
<li class="">Driver creates a DAG with two stages (split by the <code>reduceByKey</code> shuffle)</li>
<li class="">Driver requests executors from cluster manager</li>
<li class="">Stage 1 tasks (read, flatMap, map) execute on partitions across executors</li>
<li class="">Data gets shuffled for the <code>reduceByKey</code> operation</li>
<li class="">Stage 2 tasks perform the aggregation</li>
<li class="">Results get collected back to the driver</li>
</ol>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-architecture-matters">Why This Architecture Matters<a href="https://www.recodehive.com/blog/spark-architecture#why-this-architecture-matters" class="hash-link" aria-label="Direct link to Why This Architecture Matters" title="Direct link to Why This Architecture Matters" translate="no">​</a></h2>
<p>Understanding Spark's architecture isn't just academic knowledge - it's the key to working effectively with big data:</p>
<p><strong>Fault Tolerance:</strong> The RDD lineage graph means Spark can recompute lost data automatically without manual intervention.</p>
<p><strong>Scalability:</strong> The driver/executor model scales horizontally - just add more worker nodes to handle bigger datasets.</p>
<p><strong>Efficiency:</strong> Lazy evaluation and DAG optimization mean Spark can optimize entire computation pipelines before executing anything.</p>
<p><strong>Flexibility:</strong> The unified stack means you can mix SQL, streaming, and machine learning in the same application without data movement penalties.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conclusion-the-beauty-of-distributed-computing">Conclusion: The Beauty of Distributed Computing<a href="https://www.recodehive.com/blog/spark-architecture#conclusion-the-beauty-of-distributed-computing" class="hash-link" aria-label="Direct link to Conclusion: The Beauty of Distributed Computing" title="Direct link to Conclusion: The Beauty of Distributed Computing" translate="no">​</a></h2>
<p>Spark's architecture represents one of the most elegant solutions to distributed computing that I've encountered. By clearly separating concerns - coordination (driver), resource management (cluster manager), and execution (executors) - Spark creates a system that's both powerful and understandable.</p>
<p>The magic isn't in any single component, but in how they all work together. The driver's intelligence in creating optimal execution plans, the cluster manager's efficiency in resource allocation, and the executors' reliability in task execution combine to create something greater than the sum of its parts.</p>
<p>Whether you're processing terabytes of log data, training machine learning models, or running real-time analytics, understanding this architecture will help you reason about performance, debug issues, and design better data processing solutions.</p>
<hr>
<p><em>The next time you see a Spark architecture diagram, I hope you'll see what I see now - not a confusing web of boxes and arrows, but an elegant dance of distributed computing components working in perfect harmony. Happy Sparking! 🚀</em></p>
<div></div>]]></content:encoded>
            <author>rathoreadityasingh30@gmail.com (Aditya Singh Rathore)</author>
            <author>sanjay@recodehive.com (Sanjay Viswanthan)</author>
            <category>Apache Spark</category>
            <category>Spark Architecture</category>
            <category>Big Data</category>
            <category>Distributed Computing</category>
            <category>Data Engineering</category>
        </item>
        <item>
            <title><![CDATA[GitHub Copilot Coding Agent]]></title>
            <link>https://www.recodehive.com/blog/git-coding-agent</link>
            <guid>https://www.recodehive.com/blog/git-coding-agent</guid>
            <pubDate>Fri, 04 Jul 2025 00:00:00 GMT</pubDate>
            <description><![CDATA[An overview of the GitHub Copilot Coding Agent, an AI-powered tool that automates software engineering tasks by taking GitHub Issues as input to write code, run tests, and create pull requests.]]></description>
            <content:encoded><![CDATA[<p> 
In the fast-evolving world of software development, AI-powered tools are changing the game. GitHub is at the forefront with its latest innovation: the <strong>GitHub Copilot Coding Agent</strong>. More than just an in-editor assistant, this powerful new agent works asynchronously to handle entire engineering tasks on its own. Let's dive into what it is, how it works, and how you can leverage it to automate your workflow.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-what-is-github-coding-agent">🚀 <strong>What Is GitHub Coding Agent</strong><a href="https://www.recodehive.com/blog/git-coding-agent#-what-is-github-coding-agent" class="hash-link" aria-label="Direct link to -what-is-github-coding-agent" title="Direct link to -what-is-github-coding-agent" translate="no">​</a></h3>
<p>The GitHub Copilot Coding Agent is an asynchronous software engineering agent that:</p>
<ul>
<li class="">✅Takes GitHub Issues as input.</li>
<li class="">✅Writes code, runs tests, and creates pull requests—just like a teammate.</li>
<li class="">✅Works inside GitHub Actions, unlike the real-time agent mode in your IDE (e.g., VS Code).</li>
</ul>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-how-it-works">🔧 How It Works<a href="https://www.recodehive.com/blog/git-coding-agent#-how-it-works" class="hash-link" aria-label="Direct link to 🔧 How It Works" title="Direct link to 🔧 How It Works" translate="no">​</a></h3>
<p><strong>1. Write &amp; Assign an Issue to Copilot</strong><br>
<!-- -->When creating an issue for the GitHub Copilot Coding Agent, clarity and structure are key to getting the best results. Here’s how to craft an effective issue that sets Copilot up for success:</p>
<ul>
<li class="">
<p><strong>Provide Clear Context:</strong><br>
<!-- -->Begin by describing the problem or feature request in detail. Explain <em>why</em> the change is needed, referencing any relevant background, user stories, or business goals. If the issue relates to a bug, include steps to reproduce, expected vs. actual behavior, and any error messages or screenshots.
<img decoding="async" loading="lazy" alt="Creating a new GitHub issue for Copilot" src="https://www.recodehive.com/assets/images/01-code-issue-6434dc7a091818a05bd1e4164486ecc8.png" width="1622" height="895" class="img_ev3q"></p>
</li>
<li class="">
<p><strong>Define Expected Outcomes:</strong><br>
<!-- -->Clearly state what a successful resolution looks like. For features, you can add the image of expected output or drawings etc.</p>
</li>
<li class="">
<p><strong>Include Technical Details:</strong><br>
<!-- -->Add any technical constraints, dependencies, or architectural considerations. Link to relevant code, documentation, or previous issues/PRs. If there are specific files, functions, or APIs involved, mention them explicitly.</p>
</li>
<li class="">
<p><strong>Use Templates and Repo Instructions:</strong><br>
<!-- -->Leverage your repository’s issue templates to maintain consistency. Follow any contribution guidelines or coding standards documented in the repo. This ensures Copilot’s work aligns with your team’s practices.</p>
</li>
<li class="">
<p><strong>Assign the Issue to Copilot:</strong><br>
<!-- -->Just like you would with a human teammate, assign the issue to Copilot. This triggers the agent workflow and signals that the issue is ready for automated handling.
<img decoding="async" loading="lazy" alt="Assigning the GitHub issue to the Copilot agent" src="https://www.recodehive.com/assets/images/02-assign-copilot-be4fa468a0209c0f71c68b7da4c5fce5.png" width="1599" height="896" class="img_ev3q"></p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="example-issue-template"><strong>Example Issue Template:</strong><a href="https://www.recodehive.com/blog/git-coding-agent#example-issue-template" class="hash-link" aria-label="Direct link to example-issue-template" title="Direct link to example-issue-template" translate="no">​</a></h3>
<div class="language-markdown codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-markdown codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><div class="token-line" style="color:#393A34"><span class="token plain">Summary</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Briefly describe the task or bug.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Context</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Explain why this change is needed. Link to related issues or documentation.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Acceptance Criteria</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token list punctuation" style="color:#393A34">-</span><span class="token plain"> [ ] List specific outcomes or deliverables</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token list punctuation" style="color:#393A34">-</span><span class="token plain"> [ ] Include test coverage or documentation updates if needed</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Technical Notes</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Mention files, functions, or dependencies involved.</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Additional Info</span><br></div><div class="token-line" style="color:#393A34"><span class="token plain">Add screenshots, logs, or references as needed.</span><br></div></code></pre></div></div>
<p>By following these steps, you ensure Copilot has all the information it needs to deliver high-quality, context-aware code changes—making your workflow smoother and more efficient.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-what-happens-next">🌟 What Happens Next?<a href="https://www.recodehive.com/blog/git-coding-agent#-what-happens-next" class="hash-link" aria-label="Direct link to 🌟 What Happens Next?" title="Direct link to 🌟 What Happens Next?" translate="no">​</a></h3>
<p>Once you assign the issue to GitHub Copilot, the agent will analyze the requirements and begin working asynchronously. It may take a short while for Copilot to generate the code, run tests, and open a new pull request (PR) with the proposed changes.</p>
<p>You can expect:</p>
<ul>
<li class="">A new PR created automatically by Copilot, referencing the original issue.<br>
<a href="https://github.com/recodehive/recode-website/pull/141" target="_blank" rel="noopener noreferrer" class="">An example Pull Request created by GitHub Copilot</a></li>
<li class="">Automated test results and code suggestions included in the PR.</li>
<li class="">Clear traceability between your issue and the resulting code changes.</li>
</ul>
<p>Stay engaged by reviewing the PR, providing feedback, or merging it when ready. This workflow helps you leverage automation while maintaining control over your codebase.
<img decoding="async" loading="lazy" alt="Promotional banner for GitHub Copilot feedback" src="https://www.recodehive.com/assets/images/03-pr-copilot-101448e84a8b35cd5091b82c2ff5b5e3.png" width="1635" height="911" class="img_ev3q"></p>
<hr>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-earn-200-by-providing-early-stage-feedback">🧭 Earn $200 by providing Early stage Feedback<a href="https://www.recodehive.com/blog/git-coding-agent#-earn-200-by-providing-early-stage-feedback" class="hash-link" aria-label="Direct link to 🧭 Earn $200 by providing Early stage Feedback" title="Direct link to 🧭 Earn $200 by providing Early stage Feedback" translate="no">​</a></h3>
<p>💬 <strong>Share your feedback on Copilot Coding Agent for a chance to win a $200 gift card!</strong></p>
<p>We’re inviting early adopters to help shape the future of the GitHub Copilot Coding Agent. Your insights are invaluable in improving the agent’s usability, reliability, and overall experience. By participating, you’ll have the opportunity to directly influence upcoming features and enhancements.</p>
<p>📍<strong>Note:</strong> The following feedback program was available for early adopters and may no longer be active. Please check the official GitHub blog for current opportunities.</p>
<p><strong>How to participate:</strong></p>
<ol>
<li class=""><strong>Try out the Copilot Coding Agent:</strong><br>
<!-- -->Use the agent to automate coding tasks, resolve issues, or create pull requests in your repository.</li>
<li class=""><strong>Share your experience:</strong><br>
<!-- -->Provide detailed feedback on what worked well, what could be improved, and any challenges you faced. Screenshots, suggestions, and real-world use cases are especially helpful.</li>
</ol>
<p><strong>Why participate?</strong></p>
<ul>
<li class="">The most insightful and actionable feedback will be eligible for a $200 gift card.</li>
<li class="">Help make Copilot Coding Agent more effective for the entire developer community.</li>
<li class="">Get early access to new features and updates.
<img decoding="async" loading="lazy" alt="Promotional banner for GitHub Copilot Coding Agent feedback rewards" src="https://www.recodehive.com/assets/images/03-reward-copilot-72113ef2d66a4f93e06d58360c0c934a.png" width="1627" height="893" class="img_ev3q"></li>
</ul>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-conclusion">✅ Conclusion<a href="https://www.recodehive.com/blog/git-coding-agent#-conclusion" class="hash-link" aria-label="Direct link to ✅ Conclusion" title="Direct link to ✅ Conclusion" translate="no">​</a></h2>
<p>The GitHub Copilot Coding Agent represents a significant step forward in developer productivity and workflow automation. By integrating AI-driven code generation and automated pull requests directly into your GitHub processes, you can streamline repetitive tasks and focus on higher-level problem solving. While automation accelerates development, human insight and collaboration remain essential for delivering quality software. Embrace these tools to enhance your workflow, but always keep user needs and team goals at the center of your development process.</p>
<hr>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-watch-the-demo">🎥 Watch the Demo<a href="https://www.recodehive.com/blog/git-coding-agent#-watch-the-demo" class="hash-link" aria-label="Direct link to 🎥 Watch the Demo" title="Direct link to 🎥 Watch the Demo" translate="no">​</a></h2>
<p>Check out this video walkthrough of the GitHub Copilot Coding Agent in action:</p>
<iframe width="100%" height="400" src="https://www.youtube.com/embed/6AmzJDAOHJ8" title="GitHub Copilot Coding Agent Demo" style="border:none"></iframe>
<hr>
<div></div>]]></content:encoded>
            <author>sanjay@recodehive.com (Sanjay Viswanthan)</author>
            <category>GitHub</category>
            <category>SEO</category>
            <category>Coding agent</category>
            <category>Copilot</category>
            <category>AI</category>
            <category>Automation</category>
        </item>
    </channel>
</rss>