recode hive Blog

Why Data Engineers Make Better Business Analysts Than MBAs Do

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Tue, 19 May 2026 00:00:00 GMT

The VP of Marketing walked into the quarterly business review with a slide deck. Forty-three slides. The headline on slide 7 read: "Customer acquisition cost down 18% QoQ."

The data engineer sitting in the back of the room knew something was wrong before the slide finished loading.

Three weeks earlier, she had noticed a JOIN condition in the CAC calculation pipeline that was double-counting leads from the new referral program. She had filed a ticket. The ticket was still open. The number on that slide — the one the VP was presenting to the CEO, the one that would inform next quarter's $2M budget allocation — was wrong by a factor that would embarrass everyone in the room once someone finally ran the corrected query.

She raised her hand.

That moment, the data engineer who knows the data is wrong before the analyst finishes presenting it, is not an accident. It is the structural consequence of a difference in how MBAs and data engineers relate to business data. One group studies it. The other one builds the systems that produce it.

What this post argues:

Why proximity to data systems is a more durable analytical advantage than business frameworks
The five specific skills data engineers have that MBAs typically don't — and why each one matters for business analysis
Where MBAs still have a genuine edge (and data engineers should stop pretending otherwise)
What the best business analysts of the next decade will look like — and why neither camp gets there alone

This is going to step on some toes. That is intentional.

First, a Fair Definition of Terms

Before making the argument, it is worth being precise about what is actually being compared.

MBA here means someone trained in the traditional business school tradition: frameworks for strategy (Porter's Five Forces, BCG Matrix), finance (DCF, unit economics), and organizational behavior. They are taught to analyze businesses from the outside, to take a set of numbers, apply a framework, and produce a recommendation.

Data engineer here means someone who designs, builds, and operates the systems that collect, transform, store, and serve data. They spend their days inside pipelines, schemas, and query plans in direct contact with how data is actually produced, not just how it eventually appears in a report.

Business analyst here means the role that sits between raw data and business decisions: translating what the data says into what the business should do.

The argument is not that MBAs are bad at analysis in general. It is that data engineers have a structural advantage specifically in the business analyst role, because that role increasingly depends on understanding the data infrastructure underneath the numbers — not just the numbers themselves.

Reason #1: Data Engineers Know When the Number Is Wrong

This is the most important one, and it is the one that is hardest to teach.

Every metric in a business, be it revenue, churn, CAC, LTV, retention, is produced by a pipeline. That pipeline has JOIN conditions, aggregation logic, filter predicates, and data source assumptions baked into it. When the pipeline has a bug, or when upstream data quality degrades, or when the definition of a metric silently changes because someone modified a dbt model, the number in the dashboard changes too.

An MBA looking at that number sees: a trend. A data engineer who built or maintains that pipeline sees: the JOIN that changed last Tuesday, the source table that started receiving nulls on day 14 of last month, the filter that was added to "clean up outliers" that accidentally excluded an entire customer segment.

This is not a hypothetical. It happens constantly, in every organization that runs on data, at every scale. The question is not whether there are errors in your metrics, there are. The question is who in the room knows about them before the decision gets made.

note

A 2023 survey by Monte Carlo Data found that data engineers spend an average of 40% of their time on data quality issues, finding them, diagnosing them, and fixing them. That is not a cost center. That is 40% of someone's professional life spent developing an intimate understanding of where and how business data breaks.

The MBA in that quarterly business review learned Porter's Five Forces. The data engineer learned that the CAC pipeline double-counts referral leads. Both are forms of knowledge. Only one of them catches the error before the budget gets misallocated.

Reason #2: They Understand the Difference Between What Data Says and What Data Means

Here is a question that sounds simple and is actually hard: Is a spike in daily active users good news?

The MBA answers: yes, obviously. Growth is good.

The data engineer asks three questions before answering anything:

Did the event tracking code change recently? (A new screen_view event being fired twice could double DAU artificially.)
Did the definition of "active" change in the metrics layer?
Is this spike uniform across platforms, or is it isolated to one app version that might have a tracking bug?

This is not paranoia. This is pattern recognition earned by having debugged dozens of "spikes" that turned out to be instrumentation errors, schema migrations, or upstream data source changes. Data engineers develop a strong prior that anomalies in data are more likely to be measurement errors than real business events because, in their experience, that is usually true.

Business analysis requires exactly this skepticism. The job is not to report what the number says. It is to assess whether the number is trustworthy, what it actually measures, and what legitimate conclusions can be drawn from it. Data engineers are trained for this by doing it wrong enough times that it becomes instinct.

A real pattern, seen repeatedly across teams:

Scenario	MBA interpretation	Data engineer's first question
Revenue up 12% MoM	Strong growth signal	Did the billing pipeline change?
Churn down 3%	Retention improving	Was the churn definition updated?
Support tickets up 40%	Product quality issue	Did we change the ticket tagging logic?
Page load time improved	Engineering win	Is the new monitoring missing slow requests?

The MBA is not wrong to interpret those signals. The data engineer is right to interrogate them first.

Reason #3: They Think in Systems, Not Snapshots

MBA training is heavily oriented around snapshots: a financial model at a point in time, a competitive analysis as of this quarter, a market sizing exercise based on current data. The analytical unit is the report.

Data engineering is, fundamentally, about systems that produce data continuously over time. The analytical unit is the pipeline, a thing that runs repeatedly, handles changing inputs, breaks in specific ways under specific conditions, and accumulates state.

This shapes how you think about business problems in ways that matter.

When an MBA sees declining retention, they reach for a segmentation analysis: which cohort is churning, what do those users have in common, what intervention addresses the segment. This is useful analysis.

When a data engineer sees declining retention, they also ask: is the retention calculation correct, is it consistent across cohorts, are we measuring the same thing for users who signed up six months ago as for users who signed up last week, did the product change in a way that makes the old metric definition no longer comparable?

The MBA is doing cross-sectional analysis. The data engineer is doing longitudinal systems thinking — asking whether the measurement is stable across time, not just whether the trend is meaningful within one period.

This difference shows up in business analysis as the gap between finding a pattern and understanding whether the pattern is real.

info

The classic example: A SaaS company sees retention improving in their cohort analysis. The data engineer checks whether the cohort definition changed. It did — the team quietly started excluding users who never completed onboarding from the retention denominator. Retention "improved" because the measurement changed, not because users stopped churning. The MBA writes a memo about the success of the new onboarding flow. The data engineer spots the denominator change in a Git commit.

Reason #4: They Know What It Costs to Answer a Question

Here is something MBA programs do not teach: some business questions are expensive to answer, and the cost of answering them should factor into whether you ask them.

Data engineers know this instinctively. They have seen a well-meaning analyst write a query that full-scanned a 10TB table to answer a question that could have been answered with a 50MB aggregate. They have been paged at 2am because a dashboard query took down a production database. They have estimated the engineering cost of building the data infrastructure required to answer a question that turned out not to need answering.

This changes how you frame business analysis questions. When a data engineer considers a business question, they are simultaneously asking:

1. Is this data available, or does it need to be collected?
1. If collected, how clean is it, and what cleaning effort is required to use it?
1. What is the query cost of answering this at the granularity the question implies?
1. Is the question answerable at all with the data that exists, or is it being asked in a form that sounds precise but cannot be operationalized?

The last one is underrated. A lot of business questions, stated precisely, cannot be answered with existing data. "What is our true customer lifetime value?" sounds like a concrete question. A data engineer knows that answering it requires solving a customer identity resolution problem, a revenue attribution problem, and a survivorship bias problem before the math even starts and that the data to solve all three may not exist in a form that supports the precision implied by the question.

An MBA will build a DCF model around an LTV number. A data engineer will ask how the LTV was calculated and whether the denominator includes the customers who churned before their first purchase. These are not the same conversation.

Reason #5: They Have Built Things That Failed in Production

There is a specific kind of knowledge that only comes from building systems that fail in production: the knowledge that the gap between how something is supposed to work and how it actually works is almost always larger than you expect.

Data engineers live in that gap. A pipeline that processes customer events correctly in staging fails in production when the API starts sending Unicode characters in a field that was always ASCII in the test environment. A join that works perfectly on last month's data produces duplicates on this month's data because an upstream system changed its primary key generation logic. A metric that was correct for two years becomes incorrect when the product introduces a new pricing tier that the original calculation logic never anticipated.

The cumulative effect of building, breaking, debugging, and fixing data systems is a deeply skeptical relationship with any number that comes out of a system you didn't build yourself. This skepticism is not cynicism. It is calibration.

Business analysis depends on calibrated skepticism about data. The analyst who trusts every number in the dashboard is going to make bad recommendations. The analyst who knows, from experience, that dashboards lie in specific ways and in predictable circumstances is going to ask the right questions before drawing conclusions.

MBAs are trained in analytical frameworks. Data engineers are trained, by production to distrust the inputs to those frameworks until proven otherwise. In business analysis, that is often the more valuable skill.

Where MBAs Still Have the Edge (And Data Engineers Should Admit It)

This argument would be dishonest if it stopped here. MBAs have genuine advantages in business analysis that data engineers tend to lack, and those advantages are not trivial.

Stakeholder communication. The ability to take a complex finding and present it clearly to a non-technical audience, to a CFO, a board, a product team, is a skill that MBA programs drill explicitly and data engineering programs largely ignore. Data engineers frequently know the right answer and communicate it in a way that nobody acts on. That is an analytical failure, even if the analysis is technically correct.
Business context and domain knowledge. MBA training includes deliberate exposure to finance, operations, marketing, strategy, and organizational behavior. Data engineers often develop deep expertise in one or two functional areas, wherever their pipelines touch and shallow knowledge everywhere else. A data engineer who works on the payments pipeline knows a lot about transaction data and relatively little about go-to-market strategy. Business analysis requires breadth.
Framework fluency. Porter's Five Forces, the BCG matrix, unit economics, customer segmentation frameworks, these are not just jargon. They are shared vocabulary that allows business analysts to communicate efficiently with executives and cross-functional stakeholders. Data engineers who lack this vocabulary can be technically right and organizationally invisible.
Comfort with ambiguity in the absence of data. Sometimes there is no data for the decision that needs to be made. Sometimes the best analysis is qualitative, customer interviews, expert judgment, market intuition. MBA training includes frameworks for making decisions under genuine uncertainty. Data engineers can be paralyzed by the absence of clean data, waiting for a complete dataset before forming a view.

The honest summary: data engineers are better at knowing whether the data is trustworthy. MBAs are better at knowing what to do with it once trust is established. The best business analysts do both.

What the Best Business Analysts of the Next Decade Will Look Like

The role of business analyst is changing faster than either MBA programs or data engineering teams are adjusting to.

Five years ago, business analysis meant querying data that someone else built and explaining what it showed. Today, it increasingly means understanding the systems that produce the data, the definitions embedded in the pipelines, the quality characteristics of each source, and the engineering cost of the insights that the business is asking for. The data infrastructure is no longer a background condition of business analysis. It is part of the analysis itself.

This shift favors data engineers. Not because MBAs cannot learn the technical side, they can, and many of the best analysts today have done exactly that, but because the default starting point of a data engineer (systems thinking, data skepticism, production-failure experience) is closer to where the role is going than the default starting point of an MBA (framework fluency, stakeholder communication, comfort with abstraction).

The analyst who will be most valuable in 2030 is probably not a data engineer who learned to give better presentations, though that person is useful. It is probably someone who started in data engineering, developed genuine business and communication fluency, and can operate at both layers simultaneously — who can tell you that the CAC number is wrong because of a JOIN condition and also tell you what to do about the marketing budget given the corrected number.

That person is rare. Both communities should be trying to produce more of them.

The Practical Implication for Hiring Managers

If you are hiring business analysts and you are only looking at MBA credentials, you are filtering out a large population of candidates who have the specific skills that increasingly define excellent business analysis.

Some questions worth asking of any analyst candidate - MBA or data engineer:

1. "Walk me through a time when you questioned a metric that the business was relying on. What did you find?"
1. "How do you validate that a number in a dashboard is trustworthy before you use it in a recommendation?"
1. "Describe a business question you were asked that turned out to be unanswerable with existing data. What did you do?"
1. "How would you explain your most complex analysis to someone who has never seen the underlying data?"

The first three questions tend to favor data engineers. The fourth tends to favor MBAs. The best candidates handle all four.

The Practical Implication for Data Engineers

If you are a data engineer who wants to move into business analysis, the technical credibility is already there. The gap is almost always on the communication and business context side.

Specifically:

Learn to write for non-technical audiences. Your analysis is only as valuable as the decision it informs. If the decision-maker cannot understand your analysis, it will not inform the decision regardless of how technically correct it is.

Develop opinions about the business, not just the data. Business analysts are paid to have views, not just to report numbers. Build the habit of ending every analysis with a recommendation, not just a finding.

Learn enough finance to speak the language of decisions. Unit economics, margin, payback period, ROI, these are not complicated concepts, but they are the vocabulary in which business decisions get made. Fluency costs a few weeks of deliberate study and pays dividends for a career.

Stop waiting for perfect data. The business will not wait. Learn to state your assumptions explicitly, quantify your uncertainty, and make a recommendation anyway. That is what business analysis actually looks like in practice.

Key Takeaways

Data engineers are closer to the truth of the data. They know where it breaks, when to distrust it, and what the numbers actually measure. This is not a minor advantage in business analysis. It is the foundation the entire role sits on.

MBA training optimizes for communication and framework fluency. These are real skills that data engineers typically underinvest in. The ability to turn a correct analysis into a decision requires them.

The distinction between "knowing the data is wrong" and "knowing what to do when it's right" maps almost exactly onto the skill gaps between these two communities. The most valuable analysts close both gaps.

Proximity to data systems is becoming a core business analysis competency. As data infrastructure complexity grows, the analyst who does not understand the systems producing the data is increasingly at a disadvantage — not just technically, but analytically.

The future belongs to neither camp exclusively. It belongs to people who take the best of both: the systems thinking and data skepticism of the data engineer, and the communication fluency and business judgment of the MBA. Both communities should be building toward that combination, not defending their territory.

Frequently Asked Questions

Q: Isn't this just survivorship bias? You're describing the best data engineers, not the average one.

Fair point. The average data engineer is not a strong business analyst, they may have all the technical instincts described above but communicate them poorly, lack business context, or have no interest in the strategy side. The argument is not that all data engineers are better business analysts. It is that the skills data engineering develops are more directly applicable to modern business analysis than MBA training is, when those skills are coupled with business communication ability.

Q: Do MBAs actually lack technical data skills? Many MBA programs now teach analytics.

MBA programs have added data analytics courses, and some graduates are technically capable. But there is a meaningful difference between coursework in SQL or Tableau and two years of debugging production pipelines. The experiential depth of a working data engineer, the calibration that comes from building systems that fail and fixing them, is not replicable by a semester of coursework.

Q: Isn't the real answer just to hire both and have them work together?

Yes, and many strong organizations do exactly this. But the question is about individual capability in the business analyst role, not team composition. The argument is that a single data engineer with communication skills is often more effective in that role than a single MBA without data infrastructure fluency because the errors that compound silently in business analysis tend to live in the data layer, not the framework layer.

Q: What about domain-specific industries like finance or healthcare where the MBA's domain knowledge is critical?

Domain knowledge matters enormously, and in highly specialized industries, the MBA's domain fluency may outweigh the data engineer's infrastructure intuition. The argument applies most cleanly to tech and data-intensive consumer businesses, where the data infrastructure is the business in a meaningful sense and errors in data systems directly translate into errors in business decisions.

References and Further Reading

About the Author

Aditya Singh Rathore is a Data Engineer focused on building modern, scalable data platforms on Azure. He writes about data engineering, cloud architecture, and real-world pipelines on RecodeHive, turning hard-won production lessons into content anyone can apply.

🔗 LinkedIn | GitHub

📩 Data engineer who made the move into business analysis? MBA who learned the data engineering side? Drop your story in the comments, the best takes on this come from people who've lived both sides.

How We Used Purview Data Catalog to Reduce Onboarding Time for New Data Engineers from 2 Weeks to 3 Days

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Tue, 19 May 2026 00:00:00 GMT

The ticket came in on a Wednesday. A new data engineer, two weeks into the job had spent four days trying to understand why the customer_lifetime_value column in the Gold layer showed different numbers than the same field in the BI report.

It was not a pipeline bug. The column existed in two places: once in gold.customer_metrics (calculated monthly) and once in gold.customer_ltv_rolling (calculated on a 90-day rolling window). Both were correct. Neither was documented. Nobody had told the new engineer either table existed, let alone the difference between them.

He had been Slacking three different senior engineers to chase down an answer that should have taken five minutes to find independently.

That ticket was the moment we decided to fix onboarding.

What this post covers:

The exact problem structure that made onboarding slow and why it was invisible to us until we measured it
How we used Microsoft Purview's three core capabilities, searchable catalog, lineage visualization, and business glossary with ownership metadata to eliminate the "who do I ask?" loop
The configuration steps that actually moved the needle, with real before-and-after numbers for each
What we got wrong the first time, and the one thing that made the second attempt stick

The Problem, Measured

Before we changed anything, we ran a structured retrospective with four recent hires across different seniority levels. We asked one question: "In your first two weeks, where did you spend time that you wish you hadn't?"

The answers sorted into three buckets with near-perfect consistency:

Time sink	Avg. hours lost	Root cause
Finding which table to use for a given metric	18 hrs	No searchable catalog; tables discovered by asking people
Understanding upstream dependencies before touching a pipeline	14 hrs	No lineage visibility; had to trace JOINs manually through code
Figuring out who owns a dataset / who to ask about it	11 hrs	Ownership lived in people's heads or stale Confluence pages
Reading existing pipeline code to understand business logic	9 hrs	Expected; we accepted this as non-reducible

Total addressable time: 43 hours across the first two weeks. The fourth bucket, reading code, we treated as irreducible. A new engineer needs to read the code. The first three buckets were pure friction. They produced no learning, only delay.

The target was to get those 43 hours to under 5. That is the difference between a two-week ramp and a three-day one.

note

The 9 hours spent reading pipeline code did not disappear after the Purview rollout. It actually went down slightly, because engineers who understand the lineage before reading the code read it more efficiently. But we did not count on that in our projections.

Our Data Estate Before Purview

To understand what we configured, you need to know what we were working with. The estate was not enormous, but it was complex enough to be disorienting for someone new:

Data Sources
├── Azure SQL Database (transactional - orders, customers, products)
├── Kafka → Event Hubs (clickstream, app events)
├── Third-party REST APIs (marketing attribution, support tickets)
│
ADF Pipelines (ingestion, ~40 pipelines)
│
ADLS Gen2
├── bronze/    (raw, partitioned by source and date)
├── silver/    (cleaned, Delta tables, ~180 tables)
└── gold/      (aggregated, serving layer, ~60 tables)
│
Azure Synapse Analytics (SQL serving for BI)
│
Power BI (dashboards, ~25 reports)

240 tables across three layers. 40 ADF pipelines. 25 Power BI reports. No central documentation. New engineers navigated this through a combination of institutional knowledge, Slack archaeology, and luck.

The Three Purview Capabilities That Moved the Needle

We did not use every Purview feature. We used three, in a deliberate order, because each one built on the last.

Capability 1: Searchable Data Catalog (Week 1 unlock)

The first and most urgent problem: new engineers could not find tables without asking someone. The bronze, silver, and gold layers had consistent naming conventions internally, but there was no way to search across all 240 tables by business concept. If you wanted the table behind the "monthly active users" metric, you had to know to look in gold.user_engagement_monthly, a name that is only obvious in retrospect.

Purview's catalog solves this through asset scanning and enrichment. Here is the scanning configuration we used for ADLS Gen2:

// Purview Scan Configuration — ADLS Gen2 Silver Layer
{
  "name": "silver-layer-full-scan",
  "kind": "AdlsGen2Msi",
  "properties": {
    "scanRulesetName": "AdlsGen2",
    "scanRulesetType": "System",
    "collection": {
      "referenceName": "data-platform-silver",
      "type": "CollectionReference"
    },
    "dataSourceName": "adls-prod-silver",
    "scanLevel": "Full",
    "fileFormats": ["Delta", "Parquet", "CSV"],
    "filter": {
      "excludeUriPrefixes": ["silver/archive/", "silver/tmp/"]
    }
  },
  "schedule": {
    "recurrence": {
      "frequency": "Week",
      "interval": 1,
      "startTime": "2026-01-01T02:00:00Z"
    }
  }
}

Scanning alone gives you asset discovery, Purview registers every table it finds. The second step is enrichment: adding descriptions, classifications, and business tags that make assets searchable by concept rather than just by name.

We built a lightweight enrichment script that ran after each scan and pushed descriptions from our dbt schema.yml files directly into Purview via the Atlas API:

# enrich_purview_assets.py
import yaml
import requests

PURVIEW_ENDPOINT = "https://.purview.azure.com"
HEADERS = {"Authorization": f"Bearer {get_token()}"}

def push_descriptions_from_dbt(schema_yml_path: str):
    with open(schema_yml_path) as f:
        schema = yaml.safe_load(f)

    for model in schema.get("models", []):
        asset_name = f"silver.{model['name']}"
        description = model.get("description", "")
        tags = model.get("meta", {}).get("business_tags", [])

        # Find asset GUID in Purview
        search_resp = requests.post(
            f"{PURVIEW_ENDPOINT}/catalog/api/search/query",
            headers=HEADERS,
            json={"keywords": asset_name, "limit": 1}
        )
        guid = search_resp.json()["value"][0]["id"]

        # Push description and tags
        requests.put(
            f"{PURVIEW_ENDPOINT}/catalog/api/atlas/v2/entity/guid/{guid}",
            headers=HEADERS,
            json={
                "entity": {
                    "guid": guid,
                    "attributes": {
                        "userDescription": description,
                        "businessAttributes": {"tags": tags}
                    }
                }
            }
        )

push_descriptions_from_dbt("models/gold/schema.yml")

After enrichment, a new engineer searching "monthly active users" in the Purview catalog surface gets gold.user_engagement_monthly as the top result with a description, the columns it contains, who owns it, and when it was last updated.

Before: 18 hours finding the right table. After: under 20 minutes, self-serve.

Capability 2: Lineage Visualization (Day 2–3 unlock)

Finding the right table was the first unlock. Understanding whether it was safe to modify a pipeline that fed into that table was the second.

Before Purview, tracing pipeline dependencies required one of two things: reading every ADF pipeline config file in sequence, or asking a senior engineer to walk you through it. Both took hours. The first was error-prone because pipeline dependency is not always explicit in ADF, a dataset referenced in one pipeline may be consumed by three others with no obvious link in the code.

Purview's lineage graph is populated automatically by the ADF integration. Once connected, every ADF pipeline run registers its source and sink assets in Purview, building a dependency graph that is always current, not a diagram someone drew once and forgot to update.

Setting up the ADF-to-Purview lineage connection is a one-time configuration:

# Step 1 - Enable managed identity on your ADF instance
az datafactory update \
  --resource-group rg-data-platform \
  --factory-name adf-prod \
  --identity '{"type": "SystemAssigned"}'

# Step 2 - Grant ADF's managed identity the Data Curator role in Purview
az purview account add-root-collection-admin \
  --account-name purview-prod \
  --resource-group rg-data-platform \
  --object-id $(az datafactory show \
      --name adf-prod \
      --resource-group rg-data-platform \
      --query identity.principalId -o tsv)

# Step 3 - Connect ADF to Purview (done in ADF Studio UI or via ARM)
# In ADF Studio: Manage → Microsoft Purview → Connect to Purview account
# Select your Purview account. ADF will start reporting lineage on next pipeline run.

After this, every ADF pipeline run automatically updates the lineage graph. A new engineer who wants to know what feeds gold.customer_metrics opens the asset in Purview, clicks the Lineage tab, and sees the full upstream chain from the source Azure SQL tables, through the ADF copy activity, through the Spark transformation job, to the gold table without asking anyone.

The downstream view is just as valuable. Before touching a silver table, a new engineer can see exactly which gold tables and Power BI reports depend on it. That single capability eliminated the most common new-hire mistake: modifying a table without realizing it breaks a downstream report.

Before: 14 hours tracing dependencies manually. After: under 10 minutes in the lineage tab.

Capability 3: Business Glossary + Ownership Metadata (The trust layer)

The catalog tells you what tables exist. The lineage tells you how they connect. Neither tells you whether the table is the authoritative source for a given metric, who is responsible for it when something breaks, or what the business definition of the columns actually means.

Without that layer, a new engineer who found gold.customer_metrics via the catalog still had to ask: "Is this the one the finance team uses? Is customer_lifetime_value here calculated the same way as in the BI report?"

That is where the business glossary and ownership metadata close the gap.

Business Glossary — defining terms once, everywhere

We created a glossary term for every metric that had more than one implementation or a non-obvious definition. Each term includes the canonical definition, the authoritative table that implements it, and links to any non-authoritative implementations with explanations of how they differ.

// Purview Business Glossary Term — via REST API
{
  "name": "Customer Lifetime Value",
  "shortDescription": "Predicted net revenue from a customer over their entire relationship with the company.",
  "longDescription": "Calculated as average order value × purchase frequency × average customer lifespan. The canonical implementation uses a 12-month trailing window. See gold.customer_ltv_annual for the authoritative source. Note: gold.customer_ltv_rolling uses a 90-day window for short-term forecasting — do not use for finance reporting.",
  "status": "Approved",
  "anchor": {
    "glossaryGuid": ""
  },
  "contacts": {
    "Expert": [
      {"id": "", "info": "Data Platform Lead"}
    ],
    "Steward": [
      {"id": "", "info": "Finance Analytics"}
    ]
  },
  "resources": [
    {
      "displayName": "gold.customer_ltv_annual",
      "url": "https://purview.azure.com/catalog/asset/"
    }
  ]
}

Once a glossary term is created, it gets linked to the relevant table assets in the catalog. When a new engineer opens gold.customer_metrics in Purview, they see the glossary terms linked to each column, clickable definitions that explain what the column means in business terms, not just what its data type is.

Ownership metadata — answering "who do I ask?" before it gets asked

Every asset in Purview can have owners and expert contacts assigned. We built a convention: every gold table has exactly one owner (the team responsible for its accuracy) and one expert (the engineer who built or most recently maintained it). Silver tables have an expert; ownership is at the domain level.

We enforced this through a weekly scan that flagged unowned assets:

# check_unowned_assets.py — runs in CI on a weekly schedule
import requests

PURVIEW_ENDPOINT = "https://.purview.azure.com"

def get_unowned_gold_assets():
    resp = requests.post(
        f"{PURVIEW_ENDPOINT}/catalog/api/search/query",
        headers={"Authorization": f"Bearer {get_token()}"},
        json={
            "keywords": "*",
            "limit": 1000,
            "filter": {
                "and": [
                    {"collectionId": "data-platform-gold"},
                    {"not": {"attributeName": "contacts", "operator": "contains"}}
                ]
            }
        }
    )
    unowned = [a["qualifiedName"] for a in resp.json()["value"]]
    if unowned:
        # Post to Slack #data-platform-alerts
        post_slack_alert(
            f":warning: {len(unowned)} gold assets have no owner in Purview:\n"
            + "\n".join(f"• `{name}`" for name in unowned)
        )

get_unowned_gold_assets()

Within three weeks of running this check, ownership coverage on gold tables went from 34% to 97%.

Before: 11 hours chasing ownership through Slack. After: under 5 minutes, open the asset, click the owner contact, done.

What the Onboarding Experience Looks Like Now

The best way to show what changed is to walk through Day 1 of onboarding as it exists today, compared to before.

Before Purview — Day 1 task: "Understand how we calculate Monthly Active Users"

00am  — Assigned the task. Open the Gold layer ADLS container.
15am  — 60 tables in gold/. No README. Start reading table names.
45am  — Find two plausible tables: user_engagement_monthly, user_activity_agg.
00am — Slack senior engineer: "Which one is canonical?"
          No reply until 2pm (engineer in meetings).
05pm  — Directed to user_engagement_monthly. Ask what feeds it.
10pm  — Told to look at the ADF pipeline. Which one? "Search for 'user'."
30pm  — Found three pipelines with 'user' in the name.
00pm  — Slack a different engineer to confirm the right pipeline.
00pm  — Reply received. Correct pipeline confirmed.
15pm  — Start reading pipeline code.
End of day: 7+ hours. Task: still not understood end-to-end.

After Purview — Day 1, same task

9:00am  — Assigned the task.
9:03am  — Search "monthly active users" in Purview catalog.
9:04am  — gold.user_engagement_monthly appears as top result,
           tagged BI-certified, owner: Data Platform team.
9:05am  — Click Lineage tab. Full upstream graph visible:
           app_events (Event Hub) → ADF ingest_clickstream
           → silver.app_events_cleaned
           → ADF transform_user_engagement
           → gold.user_engagement_monthly
           → Power BI: Product Dashboard
9:10am  — Click business glossary link on 'active_days' column.
           Definition: "Number of days in the month the user
           triggered at least one non-background app_event."
9:15am  — Open ADF transform_user_engagement directly from lineage.
9:15am — Start reading pipeline code with full context already loaded.
End of morning: task understood end-to-end. No Slack messages sent.

The 7+ hours collapsed to 15 minutes of navigation. The rest of the day is actual work.

The Numbers, Before and After

After running the updated onboarding process with six new engineers over the following quarter, here is what we measured:

Metric	Before Purview	After Purview	Change
Median time to first independent PR	14 days	4 days	−71%
Hours lost to "who owns this?" questions	11 hrs	0.5 hrs	−95%
Hours lost to table discovery	18 hrs	0.8 hrs	−96%
Hours lost to lineage tracing	14 hrs	1.2 hrs	−91%
Senior engineer interruptions per new hire (first 2 weeks)	23	4	−83%
New hire satisfaction score (onboarding survey, /10)	5.8	8.6	+48%

The senior engineer interruption number deserves attention. Those 23 interruptions per new hire are not free. Each one costs the senior engineer 10–20 minutes of context-switching, and they compound — a team onboarding three engineers simultaneously was absorbing 60+ interruptions per two-week cycle. Purview did not eliminate senior engineer involvement in onboarding. It focused it on the things that actually require human judgment: code review, architectural decisions, domain nuance. The navigational questions — where is this, who owns that, what does this column mean, disappeared almost entirely.

tip

Track senior engineer interruption count as an onboarding KPI, not just new hire ramp time. It is a more honest measurement because it captures the full organizational cost of a slow onboarding, not just the cost to the new hire.

What We Got Wrong the First Time

The first Purview rollout, six months before the one described above, failed quietly. We scanned the assets, registered the tables, and told people it was there. Adoption was near zero. New engineers still asked on Slack.

Three things went wrong:

We scanned but did not enrich. A catalog of 240 tables with no descriptions, no tags, and no business terminology is no more useful than the storage container it reflects. The scan gives you the skeleton. The enrichment, descriptions from dbt schema files, business tags, glossary links, gives it meaning. We skipped the enrichment step and produced a very expensive directory listing.

We did not integrate it into the onboarding checklist. Purview existed as a tool, but new engineers were not told to use it on Day 1. The instinct to Slack a senior engineer is faster than the instinct to open a new tool, until the new tool is explicitly in the workflow. The fix was simple: the onboarding checklist now has "Complete the Purview orientation module" as item 3, before any pipeline work begins.

Ownership coverage was too low to be trusted. When 66% of assets have no listed owner, the catalog trains engineers to distrust it. They look up a table, see no owner, assume the catalog is incomplete, and go back to Slack. Ownership coverage is a prerequisite for trust. We got coverage to 97% before re-launching and we now enforce it with the weekly automated check described above.

warning

Do not announce your data catalog until ownership coverage is above 80% and descriptions are populated on all production-facing assets. A sparse catalog is worse than no catalog, it trains users to distrust the tool before it has a chance to prove its value.

The Configuration Checklist

If you are setting this up from scratch, here is the exact sequence that worked for us:

Phase 1 — Foundation (Week 1–2)
  ✓ Create Purview account, configure collections by data domain
  ✓ Set up managed identity, grant least-privilege access to data sources
  ✓ Configure ADLS Gen2 scan (bronze, silver, gold separately)
  ✓ Connect Azure SQL source scan
  ✓ Run first full scan — verify asset count matches expectation
  ✓ Connect ADF to Purview for automatic lineage reporting

Phase 2 — Enrichment (Week 3–4)
  ✓ Build enrichment script to push dbt descriptions → Purview via Atlas API
  ✓ Create business glossary terms for all metrics used in BI/finance reporting
  ✓ Link glossary terms to table assets and column-level metadata
  ✓ Run first ownership audit — assign owners and experts to all gold assets

Phase 3 — Operationalization (Week 5–6)
  ✓ Add Purview orientation to new hire onboarding checklist (Day 1, item 3)
  ✓ Set up weekly scan schedule
  ✓ Deploy automated unowned-asset check with Slack alerting
  ✓ Create onboarding walkthrough doc: "Your First 3 Tasks in Purview"
  ✓ Add Purview asset links to dbt model docs for cross-reference

Phase 4 — Measurement (Ongoing)
  ✓ Track: senior engineer interruptions per new hire (first 2 weeks)
  ✓ Track: time to first independent PR
  ✓ Track: Purview search volume (proxy for adoption)
  ✓ Track: ownership coverage % (target: >95% on gold, >80% on silver)

Before You Start: What Purview Cannot Do

Purview is not a documentation system. It is a metadata system. The distinction matters.

If your tables have no documentation anywhere, no dbt descriptions, no wiki pages, no comments in DDL, Purview will faithfully catalog their absence. It will find your tables, register their schemas, and draw their lineage. But it cannot infer what col_flg_v2_final means. That knowledge has to come from somewhere, and if it only exists in someone's head, Purview cannot surface it.

The enrichment step, pushing dbt descriptions into Purview via the Atlas API, works because we already had descriptions in our dbt schema files. If you don't have that, the sequencing changes: write the documentation first, then enrich the catalog. Purview is an amplifier, not a generator.

It also does not replace code review. New engineers still need to read pipeline code. Purview makes them read it with context they know what the table is for, who owns it, what feeds it, which makes them read it faster and understand it more deeply. But it does not replace the code review cycle, the mentorship conversations, or the domain knowledge that only transfers through working together on real problems.

What it replaces is navigational friction. The questions that had no business being asked of a human in the first place.

Key Takeaways

The two-week onboarding problem is a discoverability problem, not a complexity problem. New engineers are not slow because the data estate is hard to understand. They are slow because they cannot find what they are looking for without interrupting someone who built it.

Catalog + lineage + ownership is the minimum viable combination. Each one alone is insufficient. Catalog without lineage tells you what exists but not how it connects. Lineage without catalog gives you a graph you cannot search. Both without ownership tell you the map but not who to call when the territory changes.

Ownership coverage is a prerequisite, not a nice-to-have. A catalog with 34% ownership coverage teaches users to distrust the tool. Get coverage above 80% before announcing the rollout, and enforce it with automated checks after.

Enrich before you announce. A scanned-but-unenriched catalog is an expensive directory listing. The descriptions, glossary links, and business tags are what make it a tool rather than a report.

Measure the right thing. Time-to-first-PR is a lagging indicator. Senior engineer interruptions per new hire is the leading indicator, it tells you whether the catalog is actually being used before you see it in ramp time data.

Frequently Asked Questions

Q: We use Databricks Unity Catalog, not Microsoft Purview. Does this apply?

Most of the strategy applies, searchable catalog, lineage, ownership metadata, glossary but the implementation details differ significantly. Unity Catalog has native lineage for Databricks workloads, which is actually more automatic than Purview's ADF integration for Spark-heavy estates. The enrichment and ownership enforcement logic would be similar in principle, different in API. A follow-up post on Unity Catalog onboarding is in the works.

Q: How long does the initial scan take on a large estate?

Our 240-table scan completed in under 40 minutes. For larger estates (1,000+ assets), expect the first full scan to take 2–4 hours. Incremental scans after the first run are significantly faster, typically under 15 minutes for daily delta.

Q: Do you use Purview's sensitivity labels and data classification?

We do, but that is a governance and compliance story more than an onboarding story. Classification runs automatically as part of each scan and tags columns containing PII, financial data, or health information. New engineers see the classification labels on columns before writing any query, which is useful for access management training but was not a significant contributor to the onboarding time reduction.

Q: What does Purview cost, and was the ROI clear?

Purview pricing is based on data map capacity units and scan compute. For our estate (~250 assets, weekly scans), cost runs roughly $180–220/month. The ROI calculation is straightforward: a single senior engineer's hourly cost, multiplied by 23 interruptions per new hire at 15 minutes each, is about $345 per onboarding cycle at standard engineering rates. Purview paid for itself before the second new hire finished their first week.

References and Further Reading

About the Author

🔗 LinkedIn | GitHub

📩 Running Purview in your org? Or struggling with a data catalog rollout that didn't stick? Drop your experience in the comments, the most useful part of these posts is always what the comments surface that the post missed.

PySpark Optimization Techniques: 6 Mistakes That Slow Down Every Beginner's Pipeline

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Sat, 16 May 2026 00:00:00 GMT

The job ran for four hours. It processed 8GB of data.

A file copy of that same 8GB on the same machine would have taken about 45 seconds.

That gap between what Spark could do and what it actually did was entirely self-inflicted. Not because the logic was wrong. The output was correct. But six decisions that seemed harmless at the time were quietly multiplying the runtime: a Python UDF where a built-in function existed, a join that shuffled 200 million rows when it didn't have to, a read that scanned 90 days of data to find yesterday's records.

This post is about those six decisions. Each one is a pattern that beginners hit constantly, not because they're careless, but because PySpark doesn't stop you. It runs the slow version just as willingly as the fast one. You only find out at 3am when the SLA is missed.

What you'll learn in this post:

Why too many shuffle partitions is just as bad as too few and how to pick the right number
How caching works under the hood, and when it actively hurts performance
The three join strategies Spark supports and exactly when to use each one
Why Python UDFs are a performance trap and what to use instead
How predicate pushdown and column pruning reduce data read before any Spark code runs
How to size executors so you stop leaving half your cluster idle

The Pipeline We'll Optimize

Every example in this post uses a single, realistic pipeline so you can see how the fixes interact with each other:

Raw sales events (JSON, S3/ADLS)
    ↓
Bronze Delta table (~200M rows, partitioned by event_date)
    ↓
PySpark transformation job
    ↓
Silver Delta table (deduplicated, enriched, typed)
    ↓
Reporting aggregation
    ↓
Gold Delta table (daily summaries)

Before any optimization: 4 hours 12 minutes end-to-end on a 4-node cluster.
After all six fixes: 34 minutes on the same cluster.

Let's go through every mistake.

Mistake #1: Wrong Number of Shuffle Partitions

Time lost to this mistake: ~55 minutes

Every time Spark needs to reorganize data across the network after a groupBy, a join, or a distinct, it performs a shuffle. The shuffled data is split into partitions, and the number of those partitions is controlled by one setting:

spark.conf.get("spark.sql.shuffle.partitions")
# Default: "200"

The default is 200. That number made sense for the era of multi-TB Hadoop clusters it was designed for. On our 8GB pipeline, it created a different problem entirely: 200 tasks launched, each assigned a few megabytes of data, spending more time on Spark's task scheduling machinery than on the actual groupBy computation. The cluster looked busy. The progress bar moved. But most of what was happening was overhead, not work.

The inverse bites you just as hard in the other direction. Too few partitions on a genuinely large dataset means each task takes on more data than it can hold in memory and starts spilling to disk — and disk spills are catastrophically slow compared to in-memory operations.

Understanding Partitions First

Think of shuffle partitions like checkout lanes at a supermarket. If you open 200 lanes for 64 customers, each cashier handles one customer and then sits idle, you're paying for 136 empty lanes. Open 4 lanes for 200 customers and you get a queue that never moves. The goal is matching lane count to the actual number of customers, not picking a number that sounds safe.

In Spark terms: target 128MB to 200MB of data per partition after a shuffle.

Ideal partitions = Total data size after shuffle ÷ 128MB

For our 8GB transformation job (data after the join/groupBy, not raw input):

Ideal partitions = 8,000MB ÷ 128MB ≈ 63

We round to a clean number, 64 and set it before any transformation runs.

Before
After
Impact

transformation-before.py
# Default: 200 shuffle partitions
# On 8GB of data, each partition is ~40MB
# 200 tasks scheduled, most doing trivial work
# Task launch overhead dominates actual compute time

df = (
    spark.read.format("delta")
    .load("abfss://data@lake.dfs.core.windows.net/bronze/sales/")
    .groupBy("product_id", "event_date")
    .agg(sum("revenue").alias("total_revenue"))
)

transformation-after.py
# Set BEFORE any transformations run
# Rule of thumb: Total shuffle data size ÷ 128MB
# For ~8GB post-shuffle data: 8000 ÷ 128 ≈ 64

spark.conf.set("spark.sql.shuffle.partitions", "64")

df = (
    spark.read.format("delta")
    .load("abfss://data@lake.dfs.core.windows.net/bronze/sales/")
    .groupBy("product_id", "event_date")
    .agg(sum("revenue").alias("total_revenue"))
)

Setting	Shuffle Partitions	Stage Duration	Tasks Launched
Default (200)	200	48 min	200
Tuned (64)	64	11 min	64

Result: Aggregation stage dropped from 48 minutes to 11 minutes. ~37 minutes saved.

tip

For Databricks on Delta Lake, you can also enable Adaptive Query Execution (AQE), which automatically adjusts shuffle partitions at runtime based on actual data size:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

AQE doesn't replace manual tuning but it acts as a safety net when your estimate is off. We run both: manual tuning as the primary setting, AQE as the fallback.

Mistake #2: Caching Everything (Or Nothing)

Time lost to this mistake: ~28 minutes

Caching is one of the most misunderstood features in PySpark. Beginners either avoid it entirely (paying to recompute the same DataFrame multiple times) or cache everything (consuming all available memory and forcing everything else to spill to disk).

What Caching Actually Does

Calling .cache() on a DataFrame doesn't immediately store anything, Spark is lazy, so nothing happens until an action triggers computation. What .cache() actually does is plant a flag that says: the first time you compute this, hold onto the result. The next time something references this DataFrame, Spark reads from that stored result instead of re-running the entire computation from scratch.

The reason this matters is that Spark has no implicit memory of previous computations. Without caching, every action that references base_df starts from the beginning, re-reading the source files, re-running the joins, re-applying the filters. We discovered this the painful way when a pipeline that looked like one job was actually running the most expensive stage twice, adding 28 minutes to every run.

This only helps if you reference the same DataFrame more than once. If you compute a DataFrame, transform it once, and write it, caching adds overhead with zero benefit.

# This caching is useless — df is only used once
df = spark.read.format("delta").load(silver_path).cache()
df.write.format("delta").save(gold_path)

The right time to cache is when a DataFrame is expensive to compute and you reference it in multiple downstream operations.

Before (No Cache)
After (Targeted Cache)

pipeline-no-cache.py
# base_df is computed TWICE — once for each write
# Spark re-reads and re-joins from scratch each time

base_df = (
    spark.read.format("delta").load(bronze_path)
    .join(products_df, "product_id", "left")
    .filter(col("event_date") == yesterday)
)

# First action — triggers full computation of base_df
base_df.write.format("delta").mode("append").save(silver_path)

# Second action — triggers FULL recomputation of base_df again
base_df.groupBy("category").agg(sum("revenue")).write.format("delta").save(gold_path)

pipeline-with-cache.py
from pyspark import StorageLevel

base_df = (
    spark.read.format("delta").load(bronze_path)
    .join(products_df, "product_id", "left")
    .filter(col("event_date") == yesterday)
)

# Cache BEFORE the first action — base_df is used twice
# MEMORY_AND_DISK: spills to disk if memory is full (safer than MEMORY_ONLY)
base_df.persist(StorageLevel.MEMORY_AND_DISK)

# First use — computes and stores base_df
base_df.write.format("delta").mode("append").save(silver_path)

# Second use — reads from cache, no recomputation
base_df.groupBy("category").agg(sum("revenue")).write.format("delta").save(gold_path)

# Always unpersist when done — frees executor memory for the next stage
base_df.unpersist()

note

We always use MEMORY_AND_DISK rather than MEMORY_ONLY. The reason: when memory fills up, MEMORY_ONLY silently drops the cached data and recomputes it on demand, you get none of the benefit and all of the overhead. We got burned by this once when a larger-than-usual dataset caused silent eviction mid-pipeline. MEMORY_AND_DISK spills the overflow to disk instead of evicting, which is slower than memory but far better than recomputing from scratch.

Result: Eliminated one full recomputation of the join + filter stage. ~28 minutes saved.

Mistake #3: Using the Wrong Join Strategy

Time lost to this mistake: ~62 minutes

Joins are the most expensive operation in distributed computing. When two datasets need to be joined, Spark has to get rows with matching keys onto the same machine which usually means moving large amounts of data across the network. That network movement is called a shuffle, and it's where most of the time in a join stage actually goes.

PySpark supports three join strategies. Understanding which one to use and when is one of the highest-leverage optimizations available.

The Three Strategies

Sort-Merge Join (default for large tables) Both datasets are shuffled so matching keys land on the same partition, then sorted, then merged. Correct for any size. Expensive because of the full shuffle.

Broadcast Join (best for large + small table) The smaller table is collected to the driver and sent as a complete copy to every executor. The large table never moves. Dramatically faster when the small table fits comfortably in memory.

Bucket Join (best for repeated joins on the same key) Both tables are pre-arranged on disk by join key at write time. When you join two bucketed tables on their bucket key, Spark skips the shuffle entirely, the data is already sitting where it needs to be. Expensive upfront, free on every subsequent join.

Before (Default Sort-Merge)
After (Broadcast Join)
Bucket Join (Advanced)
Strategy Guide

join-before.py
# Spark defaults to Sort-Merge Join
# products_df has 50,000 rows — tiny
# But Spark doesn't know that and shuffles BOTH tables
# 200M rows of sales_df shuffled across the network

sales_df = spark.read.format("delta").load(bronze_path)
products_df = spark.read.format("delta").load(products_path)

enriched_df = sales_df.join(products_df, "product_id", "left")

join-after.py
from pyspark.sql.functions import broadcast

sales_df = spark.read.format("delta").load(bronze_path)
products_df = spark.read.format("delta").load(products_path)

# Hint tells Spark to broadcast products_df to every executor
# sales_df (200M rows) is NEVER shuffled
# products_df (50K rows) is collected once and sent to all nodes
enriched_df = sales_df.join(broadcast(products_df), "product_id", "left")

bucket-join.py
# Write tables once with bucketing — expensive upfront, free on every future join
# Use when the same large-to-large join runs repeatedly

sales_df.write \
    .bucketBy(64, "product_id") \
    .sortBy("product_id") \
    .format("parquet") \
    .saveAsTable("sales_bucketed")

events_df.write \
    .bucketBy(64, "product_id") \
    .sortBy("product_id") \
    .format("parquet") \
    .saveAsTable("events_bucketed")

# Now this join has ZERO shuffle — data is already co-located
result = spark.table("sales_bucketed").join(
    spark.table("events_bucketed"), "product_id"
)

Scenario	Strategy	Why
Large table + small table (< 200MB)	Broadcast join	Eliminates shuffle of large table
Large table + large table, one-time	Sort-merge (default)	No alternative without pre-partitioning
Large table + large table, repeated	Bucket join	Pre-pays shuffle cost once, eliminates it forever
Skewed keys (a few keys have millions of rows)	Salting + broadcast	See tip below

tip

Join skew is a related problem: when a small number of keys have a disproportionate number of rows, all that data lands on one executor which becomes a bottleneck while the rest of the cluster sits idle. The fix is salting: add a random integer (0–N) to the skewed key, replicate the smaller table N times with matching salt values, join on the salted key, then drop the salt column. This spreads the skewed key across N executors.

Result: Switching the dimension join from sort-merge to broadcast eliminated the largest shuffle in the pipeline. ~62 minutes saved.

Mistake #4: Writing Python UDFs Instead of Using Built-in Functions

Time lost to this mistake: ~38 minutes

Python UDFs (User Defined Functions) feel like a natural escape hatch. The built-in Spark functions don't cover what you need, so you write a Python function, decorate it with @udf, and move on. It works. It's just slow in a way that isn't immediately obvious and on a 200-million-row dataset, "not immediately obvious" can mean 38 extra minutes per run.

Why UDFs Are Expensive

Here's what's actually happening when a Python UDF runs on a Spark cluster: PySpark lives on the JVM, and Python lives in a completely separate process. Every single row your UDF touches has to be packaged up, handed across a process boundary into the Python runtime, processed, and then packaged back up and handed back to the JVM. It's the equivalent of passing every item from a warehouse to a worker standing outside the building through a narrow window, one item at a time, both ways.

We had three UDFs doing string cleaning on a 200-million-row DataFrame. Each UDF triggered that full cross-process handoff 200 million times. The functions themselves were trivial, a regex and some string lowercasing. The cost wasn't in the logic, it was in the 600 million window-handoffs happening around it.

Built-in Spark functions (pyspark.sql.functions) don't have this problem. They run entirely inside the JVM alongside Spark's own engine, with no process boundary to cross and no per-row packaging overhead.

Before (Python UDF)
After (Built-in Functions)
When UDFs Are Unavoidable

udf-before.py
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import re

# Registered as a Python UDF
# For 200M rows: cross-process handoff happens 200M times per UDF
@udf(returnType=StringType())
def clean_phone(phone):
    if phone is None:
        return None
    digits = re.sub(r"\D", "", phone)
    return digits if len(digits) == 10 else None

@udf(returnType=StringType())
def normalize_category(cat):
    if cat is None:
        return "unknown"
    return cat.strip().lower().replace(" ", "_")

df = (
    df.withColumn("phone_clean", clean_phone(col("phone")))
      .withColumn("category_norm", normalize_category(col("category")))
)

udf-after.py
from pyspark.sql.functions import (
    regexp_replace, when, length, trim, lower, col
)

# All native JVM execution — no cross-process overhead at all
df = (
    df
    # Strip non-digits from phone
    .withColumn("phone_digits", regexp_replace(col("phone"), r"\D", ""))
    # Keep only 10-digit numbers, null otherwise
    .withColumn(
        "phone_clean",
        when(length(col("phone_digits")) == 10, col("phone_digits")).otherwise(None)
    )
    # Normalize category: trim, lowercase, replace spaces
    .withColumn(
        "category_norm",
        when(col("category").isNull(), "unknown")
        .otherwise(
            regexp_replace(lower(trim(col("category"))), " ", "_")
        )
    )
    .drop("phone_digits")
)

pandas-udf.py
# If no built-in equivalent exists, use a Pandas UDF (vectorized)
# Pandas UDFs process data in Arrow batches, not row-by-row
# Still crosses the process boundary, but once per batch instead of once per row

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import StringType
import pandas as pd

@pandas_udf(StringType())
def complex_transform(series: pd.Series) -> pd.Series:
    # This runs on batches of rows, not individual rows
    # Use only when no built-in function covers your logic
    return series.apply(lambda x: your_complex_logic(x) if x else None)

df = df.withColumn("result", complex_transform(col("input_col")))

note

The decision tree we follow for function choice:

Does a pyspark.sql.functions built-in exist? → Use it.
Does the logic involve complex Python libraries (ML models, regex with lookbehind, etc.)? → Use a Pandas UDF.
Is there truly no alternative? → Use a Python UDF, and leave a comment explaining why.

The vast majority of string cleaning, type casting, null handling, and conditional logic is covered by built-in functions. Check the PySpark function docs before reaching for @udf — it takes 5 minutes and has saved us hours.

Result: Replaced three Python UDFs with built-in equivalents. Stage runtime dropped from 41 minutes to 3 minutes. ~38 minutes saved.

Mistake #5: Reading More Data Than Necessary

Time lost to this mistake: ~44 minutes

Before any transformation runs, the data has to come off storage and into Spark's memory. If you pull 180GB when you only need 2GB, you've already lost, no amount of smart transformation logic downstream recovers those wasted read operations.

Two mechanisms cut data at the source: predicate pushdown and column pruning. Both work with Parquet and Delta Lake natively. Both get silently deactivated by small, easy-to-miss coding patterns.

Predicate Pushdown

Imagine your Delta table as a library where each day's data lives in its own room, with the date on the door. Partition pruning is walking straight to yesterday's room. What we were doing instead was opening every room in the library, pulling every book off every shelf, carrying it all to a reading table, and then only reading the ones with yesterday's date on the spine before putting everything else back. The library was organized correctly. We just weren't reading the signs on the doors.

With 90 days of history accumulated, we were reading 90x more data than the job actually needed on every single run. The fix is pushing the date filter into the read itself, so Spark can use the partition directory structure to skip everything irrelevant before a single file is opened.

Column Pruning

Parquet stores data column by column, not row by row. This means if your table has 40 columns but your transformation uses 6, you can tell Spark to only load those 6 columns' physical data from disk. The other 34 are never touched. The catch: you have to select those columns at read time, not after a chain of transformations.

Before (Full Scan)
After (Pushdown + Pruning)
Verify It's Working

read-before.py
from datetime import datetime, timedelta

yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")

# Reads ALL columns from ALL partitions
# Then filters in memory — after all 180GB is already read
bronze_df = spark.read.format("delta").load(bronze_path)

# Filter applied AFTER load — no partition pruning, no column pruning
filtered_df = (
    bronze_df
    .filter(col("event_date") == yesterday)
    .filter(col("status") == "completed")
)

# Selecting columns here is too late — data already read into memory
result_df = filtered_df.select("event_id", "product_id", "revenue", "event_date")

read-after.py
from datetime import datetime, timedelta
from pyspark.sql.functions import col

yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")

# Column pruning: Spark reads ONLY these columns from Parquet files
# Predicate pushdown: partition filter applied at file-reader level
# Spark skips all partitions where event_date != yesterday
bronze_df = (
    spark.read.format("delta")
    .load(bronze_path)
    .select("event_id", "product_id", "revenue", "event_date", "status")  # Prune columns first
    .filter(col("event_date") == yesterday)      # Partition pruning — activates at read time
    .filter(col("status") == "completed")        # Predicate pushdown into Parquet row groups
)

verify-pushdown.py
# Confirm predicate pushdown is active — check the physical plan
bronze_df.explain(mode="extended")

# In the output, look for:
# PartitionFilters: [isnotnull(event_date#12), (event_date#12 = 2026-05-15)]
# PushedFilters: [IsNotNull(status), EqualTo(status,completed)]
#
# If you see PartitionFilters: []  →  partition pruning is NOT active
# If you see PushedFilters: []     →  predicate pushdown is NOT active

note

Two things both have to be true for partition pruning to work. First, the table must have been written with partitionBy on the column you're filtering. Second and this is the one that catches people, the filter must be on the partition column as it exists in the table, not on a renamed or derived version. We once spent an hour debugging a full scan that turned out to be caused by a withColumnRenamed("event_date", "date") sitting one line before the filter. The column name changed, Spark couldn't match it to the partition metadata, and pruning silently fell back to a full scan.

Result: Data read dropped from ~180GB to ~2GB. Read + deserialization time fell from 47 minutes to 3 minutes. ~44 minutes saved.

Mistake #6: Default Cluster Configuration

Time lost to this mistake: ~35 minutes (idle and wasted compute)

Even with perfect code, a misconfigured cluster leaves compute sitting idle. These settings control how many tasks run in parallel, how much memory each task gets, and whether the cluster actually uses all the hardware you're paying for.

Beginners typically either accept the cloud provider's defaults without question, or paste settings from a Stack Overflow answer written for a different dataset and cluster size. Neither approach reflects the actual workload.

The Key Settings and What They Do

spark.executor.memory - how much RAM each executor process gets. Too little and tasks start writing intermediate data to disk, which is dramatically slower. Too much and you've allocated headroom the executor can't use, while also giving the JVM garbage collector more memory to scan on every GC cycle.

spark.executor.cores - how many tasks an executor runs simultaneously. We settled on 5 after testing: below 4, the executor's memory sits underutilized because there aren't enough concurrent tasks to fill it. Above 5, we started seeing storage I/O contention — too many tasks competing to read from the same disks at once. Five was the sweet spot for our setup, and it matches what we've seen hold up across different cluster sizes.

spark.executor.instances - total number of executors. With autoscale on, this becomes a min/max bound rather than a fixed count.

spark.driver.memory - the driver collects broadcast tables before distributing them to executors, so it needs more headroom than the default 1g allows. We had broadcast joins failing silently and falling back to sort-merge before we realized the driver was OOM-ing on the collection step.

Right-Sizing for Our 8GB Pipeline

Our cluster: 4 worker nodes, each with 16 cores and 64GB RAM.

Available per node after OS overhead (~7GB): 57GB RAM, 15 cores
Executor cores: 5 (our tested sweet spot)
Executors per node: 15 ÷ 5 = 3 executors per node
Memory per executor: 57GB ÷ 3 = 19GB (leave ~1GB headroom → set 18GB)
Total executors: 3 × 4 nodes = 12 executors

Before (Defaults)
After (Right-Sized)
Config Reference

cluster-default.py
# Default Spark config — unchanged from cluster creation
# On our 4-node cluster, these settings leave most resources unused

spark = SparkSession.builder \
    .appName("SalesPipeline") \
    .getOrCreate()

# Defaults that hurt us:
# spark.executor.memory    = 1g   (way too small — spills to disk constantly)
# spark.executor.cores     = 1    (only 1 task per executor — 15 cores idle per node)
# spark.executor.instances = 2    (2 executors on a 4-node cluster — 50% idle)
# spark.driver.memory      = 1g   (broadcast joins silently fall back to sort-merge)

cluster-tuned.py
spark = SparkSession.builder \
    .appName("SalesPipeline") \
    .config("spark.executor.memory", "18g") \
    .config("spark.executor.cores", "5") \
    .config("spark.executor.instances", "12") \
    .config("spark.driver.memory", "8g") \
    .config("spark.sql.shuffle.partitions", "64") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "2") \
    .config("spark.dynamicAllocation.maxExecutors", "12") \
    .getOrCreate()

Setting	Default	Our Value	Rule of Thumb
`spark.executor.memory`	1g	18g	(Node RAM − OS overhead) ÷ executors per node
`spark.executor.cores`	1	5	4–5 per executor (test on your setup)
`spark.executor.instances`	2	12	(cores per node ÷ executor cores) × node count
`spark.driver.memory`	1g	8g	4–8g; higher if using large broadcasts
`spark.sql.shuffle.partitions`	200	64	Total shuffle data size ÷ 128MB

tip

For cloud clusters (Databricks, EMR, Dataproc), enable dynamic allocation instead of a fixed executor count. Dynamic allocation releases executors back to the pool during idle stages and acquires more when tasks are queuing — so a 3-minute light stage doesn't hold 12 executors that other jobs could use.

spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.minExecutors", "2")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "12")

Result: Fully utilizing all 4 nodes reduced total wall-clock time by eliminating idle compute. Combined with eliminating disk spills from under-provisioned executors: ~35 minutes saved.

Before and After Summary

info

Mistake	Root Cause	Time Before	Time After	Saved
Wrong shuffle partition count	Default 200 partitions for 8GB dataset	48 min	11 min	37 min
No caching on reused DataFrame	base_df computed twice from scratch	28 min	<1 min	28 min
Sort-merge join on dimension table	50K-row table shuffled like a large table	65 min	3 min	62 min
Python UDFs for string operations	Per-row cross-process overhead	41 min	3 min	38 min
Full table scan on partitioned table	Filter applied after read, not at read time	47 min	3 min	44 min
Default cluster config (1 core/executor)	15 cores idle per node, constant disk spill	45 min	10 min	35 min
Total		4h 12min	34 min	~3h 38min

From 4 hours 12 minutes down to 34 minutes — an 86% reduction on a pipeline doing exactly the same computation on exactly the same data.

PySpark Optimization Checklist

Run through this before every pipeline goes to production.

Shuffle & Partitions

Is spark.sql.shuffle.partitions set based on actual post-shuffle data size, not the default 200?
Is Adaptive Query Execution (spark.sql.adaptive.enabled) turned on?
Are there stages with a very large or very small number of tasks compared to the cluster size?

Caching

Is any DataFrame referenced more than once? If yes — is it cached before the first action?
Is .unpersist() called after the cached DataFrame is no longer needed?
Is StorageLevel.MEMORY_AND_DISK used instead of MEMORY_ONLY?

Joins

Is every join between a large and small table using broadcast()?
Is any large-to-large join repeated on the same key? If yes — is bucketing being used?
Are there any skewed keys? Check the Spark UI for tasks with 10x–100x longer runtimes than others in the same stage.

Functions & UDFs

Is every Python UDF replaceable with a pyspark.sql.functions built-in?
If a UDF is unavoidable, is it a Pandas UDF (vectorized) rather than a row-by-row Python UDF?

Reading Data

Are only needed columns selected at read time (not select * after transformation)?
Is the partition filter applied immediately on the read result, on the partition column itself?
Does df.explain() show PartitionFilters and PushedFilters as non-empty?

Cluster Configuration

Is spark.executor.cores set to 4–5 (not the default of 1)?
Is spark.executor.memory calculated from actual node RAM, not left at the 1g default?
Is dynamic allocation enabled for variable-length workloads?
Is spark.driver.memory set high enough to handle broadcast tables without OOM?

Key Lessons

The Spark UI is your fastest debugging tool. Every mistake above shows up in the Spark UI before you ever look at the code: long stage durations from wrong partition counts, skewed task distribution from join issues, tiny data sizes per task from over-partitioning, zero partition filters from missed pushdown. Open the UI first, read the physical plan second, look at the code third.

PySpark never stops you from writing the slow version. The job runs either way. The only difference is whether it finishes in 34 minutes or 4 hours. Spark assumes you know what you're doing — which means the performance consequences of defaults are entirely invisible until you look for them.

Built-in functions exist for almost everything. The instinct to reach for a Python UDF is understandable — Python is what most data engineers know best. But the pyspark.sql.functions module covers an enormous surface area: string manipulation, date arithmetic, array operations, conditional logic, window functions. A 5-minute search through the docs is almost always faster than the performance penalty of writing and maintaining a UDF.

Optimization compounds. None of the six fixes above is independent. Fixing the partition count makes the join faster. Fixing the join makes caching more effective. Fixing the read makes everything upstream cheaper. Start with the fix that addresses the largest stage duration in the Spark UI and work down from there.

Frequently Asked Questions

Q: How do I know if my DataFrame is actually being cached or if Spark is silently dropping it? A: The Storage tab in the Spark UI is the fastest way to check. Cached DataFrames show up there with their storage level, what fraction of the data was actually stored, and how much memory it consumed. If nothing shows up after an action runs, it either means the cache hasn't been triggered yet — caching is lazy, so it only materializes on the first action — or Spark evicted it because executor memory filled up. We switched everything to MEMORY_AND_DISK after getting burned by a silent eviction that caused a job to recompute a 20-minute stage we thought was cached. Under that storage level, Spark spills to disk instead of evicting, so you at least keep the result.

Q: Is the broadcast join threshold configurable? What if my "small" table is 300MB? A: Yes, Spark's default auto-broadcast threshold is 10MB, which is conservative. We've raised it to 300MB on tables we know are stable in size:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(300 * 1024 * 1024))

One thing we learned: don't go above 500MB without testing carefully. The driver has to collect the entire table into memory before broadcasting it out, and if you push that too high you'll see the driver OOM before the broadcast even starts and the error message isn't always obvious about what caused it.

Q: Should I always use spark.sql.adaptive.enabled? Are there downsides? A: We run it on everything now and haven't regretted it. AQE has genuinely saved us from bad shuffle partition counts more than once, particularly on pipelines where the data volume varies day to day and our static estimate was off. The one scenario where we saw it cause slowdowns was on a particularly complex query plan with 20+ joins, where AQE's planning overhead added more time than the optimization saved. We turned it off for that specific job and kept it on everywhere else.

Q: How do I find join skew in the Spark UI? A: Go to the Stages tab and look for a stage where the Max task duration is dramatically higher than the Median, anything above a 5x ratio is worth investigating. We had a stage once where the median task took 3 seconds and one task took 47 minutes. That's the classic skew signature: one executor holding a massive key while the rest of the cluster finishes and sits idle. Click into the stage, look at the task duration histogram, and if you see one bar far to the right while everything else clusters near zero, you've found it.

Q: What's the difference between .cache() and .persist()? A: In practice, we always use .persist(StorageLevel.MEMORY_AND_DISK) explicitly and skip .cache() entirely. The behavior of .cache() has changed across Spark versions, in some versions it defaults to MEMORY_ONLY, in others MEMORY_AND_DISK. Rather than remember which version does what, we just use the explicit form. It takes four more characters to type and removes all ambiguity.

Q: Can I over-partition? Is more shuffle partitions always safer? A: Yes, over-partitioning is a real problem and we've hit it. We had a pipeline where someone had set shuffle partitions to 1000 "to be safe" on a 4GB dataset. The Spark UI showed 1000 tasks completing in under 100ms each, the entire stage was task scheduling overhead, not computation. Spark's scheduler has to launch, track, and retire each task individually, and at 1000 tasks on 4GB of data, that bookkeeping cost more than the actual work. If you see a stage in the UI where every task completes in milliseconds, that's the sign you're over-partitioned. Drop the count by 4x and re-run.

References and Further Reading

About the Author

Aditya Singh Rathore is a Data Engineer focused on building modern, scalable data platforms on Azure and Databricks. He writes about data engineering, cloud architecture, and real-world pipelines on RecodeHive turning hard-won production lessons into content anyone can apply.

🔗 LinkedIn | GitHub

Azure Data Pipeline Cost Optimization: How We Cut a $4,200 Bill by 73%

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Fri, 15 May 2026 00:00:00 GMT

The Azure billing email arrived on the first of the month. $4,247.83.

Our pipeline processed roughly 2GB of sales data per day and served a Power BI dashboard to 30 users. There was no logical reason for a bill that size. Over the next three days, a line-by-line audit of Azure Cost Analysis revealed not one big mistake but six medium-sized ones, each silently running up costs in parallel, invisible until the invoice arrived.

This post is that investigation: every mistake explained, every fix documented, and the exact before-and-after numbers. If you're building data pipelines on Azure and haven't audited your costs recently, at least one of these is probably happening to you right now.

What you'll learn in this post:

Why a Dedicated SQL Pool running 24/7 is the single most expensive default mistake in Azure Synapse
How to replace nightly full loads with watermark-based incremental loads in Azure Data Factory
How to right-size Spark pools and configure auto-termination to stop paying for idle compute
How partition pruning on Delta Lake tables can reduce data scanned by over 90%
How ADLS Gen2 lifecycle policies passively save money on storage with zero ongoing effort
When a scheduled micro-batch replaces a 24/7 streaming pipeline without any business impact

The Pipeline Architecture

Before the mistakes make sense, here is the full pipeline:

Daily sales data from REST API (~2GB/day)
    ↓
ADF Pipeline (ingestion)
    ↓
ADLS Gen2 — bronze/ (raw Parquet files)
    ↓
Spark job (transformation)
    ↓
ADLS Gen2 — silver/ (Delta tables)
    ↓
Dedicated SQL Pool (serving layer)
    ↓
Power BI dashboard (30 users)

A standard Medallion Architecture, nothing exotic. 2GB of data per day. 30 users. Should have cost a few hundred dollars a month at most. It cost ~$4,247.83. Here is exactly why.

Mistake #1: Dedicated SQL Pool Running 24/7

Monthly cost of this mistake: ~$1,800

This was the single largest line item. A Dedicated SQL Pool at DW200c was provisioned to serve the Power BI dashboard and left running continuously 24 hours a day, 7 days a week because auto-pause had never been configured.

The thing that surprised me most when I first dug into the bill was this: the SQL Pool charged us the same rate at 3am on a Saturday as it did during peak usage on a Tuesday afternoon. I had assumed, naively, that there was some kind of idle detection built in. There isn't. When it's provisioned, you're paying - full stop, whether a single query runs or not. Our 30 users were active between 9am and 6pm on weekdays, 45 hours of actual usage per week. The pool was running for 168 hours per week. That's 123 hours of idle, fully-billed compute every single week, and it showed up on our invoice as a flat $1,800 charge with no breakdown by usage.

note

The SQL Pool doesn't throttle billing when it's idle, provisioned capacity is billed by the hour regardless of query activity. Pausing the pool is the only way to stop the DWU clock. Storage costs continue when paused, but the compute component which is the large part, stops completely.

The Fix: Auto-Pause with Azure Automation Runbooks

The solution is two Azure Automation runbooks, one to pause the pool at the end of business hours, one to resume it in the morning. The runbooks use Managed Identity for authentication, which avoids hardcoding credentials.

pause-sql-pool.py
# Azure Automation Runbook — pause Synapse SQL Pool outside business hours
from azure.identity import ManagedIdentityCredential
from azure.mgmt.synapse import SynapseManagementClient

credential = ManagedIdentityCredential()
client = SynapseManagementClient(credential, subscription_id)

# Pause at 7pm weekdays
client.sql_pools.begin_pause(
    resource_group_name,
    workspace_name,
    sql_pool_name
)

Schedule the pause runbook at 7pm weekdays and the resume runbook at 8am. Weekends stay paused unless a manual override is triggered through the Azure portal.

Result: Billed hours dropped from 720 to roughly 210 per month. SQL Pool cost fell from ~$1,800/month to ~$530/month, a saving of $1,270/month.

tip

If your workload is exploratory rather than dashboard-serving, consider whether Serverless SQL Pool is sufficient. Serverless pools bill per TB of data scanned rather than provisioned DWUs, which can be significantly cheaper for infrequent query patterns.

Mistake #2: Full Load Running Every Night Instead of Incremental

Monthly cost of this mistake: ~$620

The ADF pipeline was configured to pull all records from the source database on every nightly run not just new or updated ones. Day 1 it pulled 2GB. By Day 60 it was pulling 120GB, processing records that had already been processed 59 times before.

This pattern is extremely common and extremely expensive. Every night, the ADF pipeline read the entire historical dataset, the Spark transformation job processed all of it, and the results were written back to Delta. The billable compute scaled with the dataset size, not with the actual volume of new data.

The Fix: Watermark-Based Incremental Loading

A watermark stores the timestamp of the last successfully processed record. Every pipeline run reads only records newer than that timestamp, then updates the watermark on success.

The implementation in ADF uses two Lookup activities and a query parameterized by the watermark value:

Query
Output

get-watermark.sql
-- Step 1: ADF Lookup activity — retrieve the last watermark
SELECT last_processed_date
FROM pipeline_watermarks
WHERE pipeline_name = 'sales_ingestion'

incremental-source-query.sql
-- Step 2: Source query, parameterized by watermark from Step 1
SELECT *
FROM orders
WHERE updated_at > '@{activity("GetWatermark").output.firstRow.last_processed_date}'
  AND updated_at <= '@{utcnow()}'

update-watermark.sql
-- Step 3: After successful pipeline run, advance the watermark
UPDATE pipeline_watermarks
SET last_processed_date = '@{utcnow()}'
WHERE pipeline_name = 'sales_ingestion'

Pipeline Run	Records Processed	Data Volume	ADF Cost
Before (full load, Day 60)	~4.8M rows	~120 GB	~$22/run
After (incremental)	~8,000 rows	~2 GB	~$0.40/run

Result: ADF activity runtime dropped by 94%. Spark compute for the transformation step fell proportionally. Combined saving: ~$585/month.

tip

The watermark pattern requires a reliable updated_at or created_at column in the source table. If your source does not have one, work with the source team to add it, the cost saving on the pipeline side will far outweigh the schema migration effort.

Mistake #3: Spark Cluster Over-Provisioned for the Actual Workload

Monthly cost of this mistake: ~$480

When setting up the Spark pool in Azure Synapse, the default node size DS3_v2 (4 cores, 14GB RAM) was selected with 5 nodes. The actual workload: transforming 2–5GB of Parquet files daily with deduplication, type casting, and a few joins.

Two problems compounded each other. First, the cluster was consuming roughly 10x the compute it actually needed for the data volume. Second, auto-termination was set to 60 minutes, meaning after a 12-minute job, the cluster sat idle and fully billed for another 48 minutes before shutting down.

The Fix: Right-Sizing, Autoscale, and Fast Termination

The fix has three components that work together:

synapse-spark-pool-config.json
{
  "nodeSize": "Small",
  "minNodeCount": 2,
  "maxNodeCount": 4,
  "autoscaleEnabled": true,
  "autoTerminationEnabled": true,
  "autoTerminationDelayInMinutes": 5
}

The third component is tuning the shuffle partition count inside the Spark notebook. The default of 200 partitions is calibrated for large clusters and large datasets. For 2–5GB of data on a small cluster, 200 partitions creates unnecessary overhead that extends job runtime and therefore billed compute time.

spark-notebook-config.py
# Place in the first cell of every Spark notebook
# Default is 200 partitions — designed for multi-TB workloads
# For 2–5 GB datasets, 8 is appropriate
spark.conf.set("spark.sql.shuffle.partitions", "8")

info

A good rule of thumb for spark.sql.shuffle.partitions: aim for roughly 128MB of data per partition. For a 2GB dataset, that's approximately 16 partitions. Err slightly lower rather than higher for small datasets on small clusters.

Result: Spark compute cost dropped from ~$580/month to ~$100/month, a saving of $480/month.

Mistake #4: Reading ADLS Gen2 Files Without Partition Pruning

Monthly cost of this mistake: ~$290

The Silver layer Delta table was partitioned by order_date. The Spark transformation job, however, was reading the entire table and applying a date filter after the read not during it.

Think of it like this: imagine your filing cabinet is organized by month, one drawer per month, clearly labelled. Partition pruning is pulling open only January's drawer. What we were doing instead was dumping every drawer onto the floor, sifting through three years of paper, and throwing away everything that wasn't from January then tidying it all back up. Every. Single. Night. The cabinet is organized correctly. We just weren't using the labels.

With 90 days of accumulated history, this approach was scanning 90x more data than necessary on every run. The fix is to push the filter into the read itself so Spark can use the partition directory structure to skip everything irrelevant before a single file is opened.

Before (Expensive)
After (Optimized)

transformation-before.py
# Reads the ENTIRE Delta table, then filters in memory
# With 90 days of history: scans ~180GB to get ~2GB of useful data
silver_df = spark.read.format("delta").load(
    "abfss://data@mylake.dfs.core.windows.net/silver/sales/"
)
filtered_df = silver_df.filter(col("order_date") == yesterday)

transformation-after.py
from datetime import datetime, timedelta

yesterday = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")

# Filter pushed into the read — Spark only opens yesterday's partition
# With 90 days of history: scans ~2GB instead of ~180GB
silver_df = (
    spark.read.format("delta")
    .load("abfss://data@mylake.dfs.core.windows.net/silver/sales/")
    .filter(col("order_date") == yesterday)
)

Result: Data scanned per run dropped from ~180GB to ~2GB. Spark runtime fell from 18 minutes to 4 minutes. Monthly saving: ~$260/month.

note

For partition pruning to work, two conditions must both be true. The table must be partitioned by the filter column, and the filter must be applied at read time not in a subsequent transformation step. Applying the filter even one .filter() call after the .load() still results in a full table scan in some execution contexts.

Mistake #5: Keeping Historical Data on Hot Storage Tier

Monthly cost of this mistake: ~$180

The ADLS Gen2 Bronze layer had 14 months of raw Parquet files sitting on the Hot storage tier. No lifecycle policy had ever been configured.

ADLS Gen2 charges different rates depending on the storage access tier. When I pulled our actual invoice line items, the numbers told a clear story: our Bronze container was costing us $0.023 per GB per month on Hot, while data we hadn't touched in months was sitting right next to yesterday's files paying the same rate. Moving files older than 30 days to Cool dropped that rate to roughly $0.013/GB, about 44% less for data we only needed occasionally. Files older than 180 days dropped to Archive at around $0.002/GB, which is where old Bronze raw files belong when the Silver layer already has the clean version.

Fourteen months of ~2GB/day accumulates to roughly 850GB in the Bronze layer. The fix required exactly one policy configuration.

The Fix: ADLS Gen2 Lifecycle Management Policy

lifecycle-policy.json
{
  "rules": [
    {
      "name": "bronze-tier-management",
      "type": "Lifecycle",
      "definition": {
        "filters": {
          "prefixMatch": ["bronze/"],
          "blobTypes": ["blockBlob"]
        },
        "actions": {
          "baseBlob": {
            "tierToCool": {
              "daysAfterModificationGreaterThan": 30
            },
            "tierToArchive": {
              "daysAfterModificationGreaterThan": 180
            }
          }
        }
      }
    }
  ]
}

Bronze files automatically move to Cool after 30 days and Archive after 180 days — with no pipeline changes and no ongoing maintenance.

tip

Apply lifecycle policies to the Silver and Gold layers too, with longer thresholds. Silver data accessed for backfills after 90 days can move to Cool. Gold data older than 365 days can move to Archive if your reporting doesn't require historical drill-downs that old.

Result: ~$160/month in combined storage and egress savings, purely passive, set once and forgotten.

Mistake #6: A Streaming Pipeline for 15-Minute Update Requirements

Monthly cost of this mistake: ~$380

A secondary pipeline fed a "near real-time" inventory dashboard. The product team had asked for updates as fast as possible, which was interpreted as: build a Kafka + Flink streaming pipeline with always-on infrastructure.

What the product team actually needed, when pinned down to a specific number: inventory counts updated every 15 minutes.

A streaming pipeline running 24/7 to deliver 15-minute updates is the cloud equivalent of leaving your car engine running all night because you have an early meeting. The always-on Kafka cluster and Flink job cost $380/month. The business requirement was achievable with a job that runs for 2–3 minutes, 96 times a day.

The Fix: Micro-Batch with ADF Tumbling Window Trigger

An ADF Tumbling Window trigger fires the pipeline every 15 minutes. Each run reads only the delta since the last watermark, processes it, and shuts down. No infrastructure stays running between executions.

tumbling-window-trigger.json
{
  "type": "TumblingWindowTrigger",
  "typeProperties": {
    "frequency": "Minute",
    "interval": 15,
    "startTime": "2024-01-01T00:00:00Z",
    "retryPolicy": {
      "count": 3,
      "intervalInSeconds": 30
    }
  }
}

The pipeline runs for 2–3 minutes every 15 minutes, processes the delta since the last run using the same watermark pattern from Mistake #2, then shuts down. The product team's dashboard still updates every 15 minutes. They noticed zero difference.

info

A useful mental model for this decision: streaming is the right choice when latency requirements are below 60 seconds. For anything above that threshold, a well-designed micro-batch pipeline is almost always cheaper, simpler, easier to monitor, and easier to debug. The hidden costs of streaming pipelines go beyond compute they include more complex failure handling, harder-to-test logic, and longer debugging cycles.

Result: Streaming infrastructure cost of $380/month replaced by ~$40/month in ADF + Spark compute. $340/month saved.

Before and After Summary

info

Mistake	Monthly Cost Before	Monthly Cost After	Saving
Dedicated SQL Pool running 24/7	$1,800	$530	$1,270
Full load instead of incremental	$620	$35	$585
Over-provisioned Spark cluster	$580	$100	$480
No partition pruning	$290	$30	$260
Hot storage for historical data	$180	$20	$160
Streaming for 15-min updates	$380	$40	$340
Total	$3,850	$755	$3,095

From $4,247 down to approximately $1,150 after all fixes, a 73% cost reduction on a pipeline doing exactly the same work on exactly the same data.

Cost Optimization Checklist

Run through this every quarter. Each item is a question, if the answer is "no" or "I don't know," investigate it.

Dedicated SQL Pool

Is auto-pause configured for outside business hours?
Is Dedicated SQL Pool actually required, or would Serverless SQL Pool suffice for the query pattern?

ADF Pipelines

Are any pipelines running full loads where incremental loads are possible?
Is a watermark implemented for every pipeline reading time-series data?

Spark Pools

Is node size right-sized for actual data volume, not default?
Is auto-termination set to 5 minutes, not 30–60?
Is spark.sql.shuffle.partitions tuned to actual data size?
Is autoscale enabled with realistic min/max node counts?

ADLS Gen2

Are lifecycle policies configured on all containers?
Are Delta tables partitioned by the column filtered most frequently?
Is partition pruning applied at read time in all Spark notebooks?

Streaming Infrastructure

What is the actual latency requirement, in minutes?
If it is above 5 minutes — is a micro-batch pipeline in use instead of always-on streaming?

Key Lessons

Defaults are expensive by design. Azure's defaults - 60-minute Spark termination, no SQL Pool auto-pause, no lifecycle policies are chosen for zero-friction setup, not cost efficiency. Every default should be reviewed and overridden deliberately on day one, not after the first billing surprise.

Incremental loading is not a future optimization. For any pipeline reading time-series data from a growing source, a full load that runs daily compounds in cost the same way interest compounds on debt. The watermark pattern takes a few hours to implement and pays for itself within a week.

Partition pruning is free performance. Setting up partitioning correctly at table creation and pushing filters into the read step costs nothing to implement and can reduce Spark compute by over 90%. The only requirement is knowing which column you filter on most frequently which you almost certainly already know.

"Real-time" almost never means real-time. The product team's requirement was 15-minute updates. The engineering interpretation was 24/7 streaming infrastructure. The gap between those two decisions cost $340/month and made the pipeline significantly harder to maintain. Before designing streaming, ask for a specific latency number, then design to that number.

Azure Cost Analysis is a weekly habit, not a monthly emergency. The six mistakes above were invisible until the invoice arrived. Fifteen minutes a week in Azure Cost Management catches problems while they are a $50 anomaly, not a $500 line item.

Frequently Asked Questions

Q: Should I always use Serverless SQL Pool instead of Dedicated SQL Pool to save costs? A: We actually tested this switch on our own setup before committing to the auto-pause approach. Serverless made sense for us until we crossed about 6 hours of daily query time, below that threshold, serverless was cheaper every single month without exception, and we didn't need to manage any pause/resume scheduling at all. If your Power BI dashboard is hit heavily throughout the business day, dedicated will eventually win on pure cost. But if usage is bursty or confined to a few hours, don't provision dedicated capacity and then fight to keep it paused, just go serverless from the start.

Q: What if my source system doesn't have a reliable updated_at column for watermarking? A: We ran into this with one of our source databases, no timestamp, no audit column, nothing. We ended up going the CDC route using Debezium, which captures row-level changes at the database log level without touching the source schema at all. It took about a day to set up and was the cleanest solution we found. For append-only tables, an auto-incrementing primary key works as a watermark substitute. If neither option exists, row hash comparison is a last resort, it detects changes, but you're still reading the full source every run, which defeats most of the point.

Q: How do I choose the right partition column for a Delta Lake table? A: The short answer is: whichever column appears in your most common filter is your partition column, and for time-series data that's almost always a date. What I'd caution against is partitioning on something high-cardinality like user ID or transaction ID, we made that mistake early on and ended up with thousands of tiny files that made partition pruning useless and actually slowed down reads. A healthy partition should hold somewhere between 100MB and 1GB of data. If a partition is smaller than that, you're creating file overhead without the pruning benefit.

Q: Will moving data to Archive tier in ADLS Gen2 break my backfill pipelines? A: I learned this the hard way, a backfill job failed at 2am because Archive data doesn't just open like a normal file. You have to explicitly rehydrate it first, and depending on the priority tier you choose, that can take anywhere from an hour to fifteen hours. We got caught by this once, and now the rule on our team is: only archive data where we have at least 24 hours of lead time if a backfill request comes in. For Bronze raw files older than 180 days, that's usually fine, nobody's doing emergency backfills on six-month-old raw data. For Silver and Gold, we stop at Cool tier and don't go to Archive at all, because the rehydration wait is unacceptable mid-incident.

Q: Is a 5-minute Spark auto-termination window too aggressive for interactive notebooks? A: For scheduled production jobs, 5 minutes is ideal, the job finishes, and within 5 minutes the cluster is gone and billing stops. For interactive development work, 5 minutes will drive you crazy because the cluster spins down between notebook cells if you pause to think. What we do is maintain two separate Spark pool configurations: one for production jobs set to 5-minute termination, and one for dev work set to 60 minutes. They run on different node sizes too. Keep them separate and you get cost efficiency in production without interrupting development flow.

Q: How do I detect whether partition pruning is actually working in my Spark job? A: The fastest way is to run df.explain() and look at the physical plan output. If partition pruning is active, you'll see a PartitionFilters entry in the scan node listing your filter predicate. If that field shows PartitionFilters: [] - empty brackets, you're scanning the full table regardless of what your code looks like. I've caught this bug three times by checking the plan after what looked like a correctly written filter, and each time it turned out the filter was being applied one transformation step too late, after Spark had already committed to a full read.

References and Further Reading

About the Author

Aditya Singh Rathore is a Data Engineer focused on building modern, scalable data platforms on Azure. He writes about data engineering, cloud architecture, and real-world pipelines on RecodeHive — turning hard-won production lessons into content anyone can apply.

🔗 LinkedIn | GitHub

📩 Got an Azure bill that surprised you? Drop the line item in the comments — happy to help debug it.

Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Thu, 07 May 2026 00:00:00 GMT

It was a Tuesday afternoon when our analytics lead sent a message that made my stomach drop.

"The revenue numbers in the dashboard don't match what finance is reporting. We're off by $180,000. Can you check the pipeline?"

I spent the next four hours tracing data through a tangled mess of transformations, none of them documented, some running directly on raw API responses, others written six months ago by someone who had since left the team. By the time I found the issue (a deduplication step that had silently stopped working after a schema change upstream), the damage was done. Three teams had been working off wrong numbers for two weeks.

That incident is what introduced me to Medallion Architecture.

Not as a concept from a blog post. As a solution to a real, expensive, embarrassing problem that could have been caught immediately if we'd had any structure in how data moved through our pipeline.

So, What Is It?

Think of Medallion Architecture like a water filtration system.

Water from a river (your raw data) goes through multiple stages of filtering before it's safe to drink (your final reports). You wouldn't drink straight from the river — and you shouldn't build reports directly on raw, unvalidated data either.

The architecture divides your data journey into three layers:

Bronze → Silver → Gold

Each layer has one job. Each layer makes the data a little more trustworthy. By the time data reaches the end, it's reliable, consistent, and ready to power real business decisions.

🥉 Bronze: The "Keep Everything" Layer

Bronze is where data arrives, exactly as it came from the source. No cleaning, no filtering, no judgment.

APIs, databases, logs, CSV exports, it all lands here, untouched.

After the revenue incident, the first thing we did was create a Bronze layer in ADLS Gen2, a dedicated folder where every raw API response landed as-is, timestamped, and never overwritten.

Why not clean it immediately?

Because you will make mistakes in your pipeline. And when you do, you need to be able to go back to the original data and start over, without re-calling the API, without re-pulling from a source that may have already changed.

Bronze is your safety net. It's immutable, append-only, and complete.

Think of it as your data's long-term memory, messy, raw, but irreplaceable.

What Bronze looks like in practice

adls-gen2/
  └── bronze/
        └── sales/
              └── 2024/
                    ├── 01/raw_orders_20240115.parquet
                    ├── 02/raw_orders_20240201.parquet
                    └── 03/raw_orders_20240305.parquet

Files land here partitioned by date. Nothing is modified after landing. If the pipeline fails three steps later, you don't re-ingest, you reprocess from Bronze.

Key rules for Bronze

Append only: never overwrite or delete records
No transformation: store exactly what the source sent, including bad records
Schema as-received: don't enforce structure here, even if the source changes its format
Partition by ingestion date: makes reprocessing specific time ranges simple

🥈 Silver: Where the Real Work Happens

This is where data engineering gets interesting and where most of the actual work lives.

In the Silver layer, you take everything from Bronze and make it usable:

Deduplicate - remove duplicate records from retry logic or overlapping ingestion windows
Standardize - dates in ISO format, currencies in base units, strings trimmed and consistent
Validate - flag or quarantine records that fail business rules (negative prices, missing required fields)
Enforce schema - write Delta tables with defined column types and constraints
Enrich - join raw records with reference data (product names, region codes, customer tiers)

Most of the heavy lifting in a data pipeline lives here. It's not glamorous work but it's what separates trustworthy analytics from chaos.

Think of it as the editorial desk, messy raw material goes in, clean, consistent content comes out.

What Silver looks like in practice

Here's a simple PySpark transformation from Bronze to Silver:

Reference code

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, lower, trim, when

spark = SparkSession.builder.appName("BronzeToSilver").getOrCreate()

# Read from Bronze
bronze_df = spark.read.format("parquet").load(
    "abfss://data@mylake.dfs.core.windows.net/bronze/sales/2024/"
)

# Clean and validate
silver_df = (
    bronze_df
    .dropDuplicates(["order_id"])                              # deduplicate
    .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
    .withColumn("region", lower(trim(col("region"))))          # standardize
    .withColumn("product", lower(trim(col("product"))))
    .withColumn(
        "is_valid",
        when(col("amount") > 0, True).otherwise(False)        # validate
    )
    .filter(col("order_id").isNotNull())                       # remove nulls
)

# Write to Silver as Delta table
(
    silver_df.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .save("abfss://data@mylake.dfs.core.windows.net/silver/sales/")
)

print(f"Silver layer written: {silver_df.count()} records")

The deduplication step alone would have prevented our $180,000 revenue discrepancy. The raw Bronze data had duplicate order records from a retry bug in the API client. Silver catches them. Gold never sees them.

One big win beyond fixing bugs: multiple teams can now pull from the same Silver datasets instead of each building their own version of the truth. That alone eliminates an enormous amount of duplicate work and conflicting numbers.

What Silver looks like in storage

adls-gen2/
  └── silver/
        └── sales/
              ├── _delta_log/     ← Delta Lake transaction log
              ├── part-00000.parquet
              └── part-00001.parquet

Unlike Bronze (raw files), Silver is a proper Delta table with ACID guarantees, time travel, and schema enforcement.

🥇 Gold: Built for Business, Not Engineers

Gold is what your stakeholders actually see.

This layer takes clean Silver data and shapes it for specific use cases, sales dashboards, executive reports, product metrics. It's aggregated, optimized, and structured for fast queries.

You're not building for flexibility here. You're building for clarity.

Think of it as the finished product on the shelf, packaged, polished, and ready to use.

What Gold looks like in practice

from pyspark.sql.functions import sum, count, avg, col

# Read from Silver
silver_df = spark.read.format("delta").load(
    "abfss://data@mylake.dfs.core.windows.net/silver/sales/"
)

# Build Gold: monthly revenue by region
gold_df = (
    silver_df
    .filter(col("is_valid") == True)
    .groupBy("region", "order_date")
    .agg(
        count("order_id").alias("total_orders"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_order_value")
    )
    .orderBy("order_date", "region")
)

# Write to Gold
(
    gold_df.write
    .format("delta")
    .mode("overwrite")
    .save("abfss://data@mylake.dfs.core.windows.net/gold/revenue_by_region/")
)

The Gold table is what Power BI connects to. Pre-aggregated, fast, shaped exactly for the business question it answers.

What Gold looks like in storage

adls-gen2/
  └── gold/
        ├── revenue_by_region/      ← one table per business use case
        ├── customer_summary/
        └── product_performance/

Notice: Gold is not one big table. Each Gold table answers one specific business question.

Why This Actually Matters

Here's what Medallion Architecture would have changed about our Tuesday afternoon incident:

Problem we had	Without Medallion	With Medallion
Duplicate orders from API retry bug	Silently corrupted revenue reports	Caught and removed in Silver
Couldn't find where numbers went wrong	Four hours of undocumented rabbit holes	Isolated to exactly one layer
Re-ingesting data after the fix	Re-called the API (data had since changed)	Replayed from Bronze (data preserved)
Finance and analytics had different numbers	Both teams built their own transforms	Both teams use the same Silver table
Schema changed upstream, broke pipeline	Broke everything simultaneously	Bronze absorbed it, Silver flagged it

The pattern isn't just about organization, it's about trust. When your team knows exactly where data came from and how it was transformed at each step, confidence in analytics goes up. Decisions improve. Four-hour debugging sessions stop happening.

It's Not Always Perfect

Let's be honest: Medallion Architecture does add complexity.

More layers = more storage, more pipelines, more things to maintain. For a small team doing simple reporting, it might genuinely be overkill.

It's a great fit when:

You have multiple data sources with varying quality
Multiple teams consume the same data
Data quality is non-negotiable
Pipelines need to be recoverable and replayable
You need to audit exactly where a number came from

It's probably overkill when:

You have one small, clean dataset
It's a one-time analysis
You're just building a proof of concept

Beyond the Three Layers

In practice, teams often extend the model:

Landing / Staging layer — temporary storage before Bronze, used when data needs to be decrypted, unzipped, or format-converted before it can be stored
Feature layer — prepared datasets for ML model training, maintained by data science teams on top of Silver
Semantic layer — business-friendly models sitting between Gold and end users for self-serve BI

The three-tier model is a starting point, not a ceiling. The right number of layers is whatever your team actually needs.

The Full Folder Structure

Here's what a complete Medallion Architecture implementation looks like in ADLS Gen2:

adls-gen2/
  └── data/
        ├── bronze/
        │     ├── sales/2024/01/raw_orders_20240115.parquet
        │     └── customers/2024/01/raw_customers_20240115.json
        │
        ├── silver/
        │     ├── sales/
        │     │     ├── _delta_log/
        │     │     └── part-00000.parquet
        │     └── customers/
        │           ├── _delta_log/
        │           └── part-00000.parquet
        │
        └── gold/
              ├── revenue_by_region/
              ├── customer_summary/
              └── product_performance/

This is the exact structure we adopted after the revenue incident. Bronze preserved everything. Silver caught the duplicates. Gold gave the business team numbers they could trust.

The Key Lessons

1. Raw data and report data should never live in the same layer. The moment raw data flows directly into a dashboard, you've lost the ability to catch errors before they reach stakeholders.

2. Bronze is not a dumping ground, it's a source of truth. Its value is that it's complete and immutable. The messiness is the point.

3. Most data engineering work happens in Silver. Deduplication, validation, standardization this is where pipeline quality is actually built.

4. Gold tables are specific, not flexible. One table per business use case. Pre-aggregated, fast, and shaped exactly for the question it answers.

5. When something breaks, you replay from Bronze. You never re-ingest from source. Bronze is your checkpoint.

References & Further Reading

About the Author

I'm Aditya Singh Rathore, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on RecodeHive — turning hard-won lessons into content anyone can learn from.

🔗 LinkedIn | GitHub

📩 Had a similar pipeline disaster? Drop it in the comments I'd love to hear how you solved it.

Azure Data Factory Pipeline: Build Your First ETL in 10 Minutes

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Wed, 06 May 2026 00:00:00 GMT

The first time someone asked me to "build an ETL pipeline," I nodded confidently and then quietly searched "what is ETL" on my second monitor.

Extract. Transform. Load.

Three words that describe something every data team does dozens of times a day — pulling data from somewhere, doing something to it, and putting it somewhere more useful. Simple idea. Historically, painful to implement.

You'd write Python scripts that broke when the source schema changed. You'd schedule them with cron jobs that nobody monitored. You'd debug failures at 2am by reading raw logs.

Azure Data Factory (ADF) exists to replace all of that with a visual, managed, scalable pipeline service, one where you can build a working ETL in minutes, not days, and monitor it from a dashboard instead of a terminal.

This guide walks you through everything, the concepts, the components, and a complete step-by-step pipeline you can build right now.

What Is Azure Data Factory?

Azure Data Factory is Microsoft's cloud-native ETL and data integration service. It lets you build data pipelines, workflows that move data from one place to another, transform it along the way, and load it into a destination where it's actually useful.

The key word is visual. ADF gives you a drag-and-drop canvas where you connect activities, configure sources and destinations, and build complex workflows without writing infrastructure code.

Under the hood, it handles:

Connecting to 90+ data sources (databases, APIs, files, SaaS apps)
Moving data at scale using managed compute
Scheduling and triggering pipeline runs
Monitoring, alerting, and retry logic

Think of it as the orchestration layer of your Azure data stack, the thing that decides what data moves where, when, and how.

The 4 Concepts You Need to Know First

Before you touch the UI, these four concepts need to click. Everything in ADF is built on them.

1. Linked Service: The Connection

A Linked Service is a connection string. It tells ADF how to connect to an external resource — a storage account, a database, an API.

Think of it as the key to a door. Before ADF can read from your Blob Storage or write to your SQL database, it needs a Linked Service that holds the credentials and connection details for that resource.

You create a Linked Service once, then reuse it across as many datasets and pipelines as you need.

Examples:

AzureStorageLinkedService → connects to your ADLS Gen2 account
AzureSqlLinkedService → connects to your Azure SQL Database
RestApiLinkedService → connects to an external HTTP API

2. Dataset: The Pointer

A Dataset points to the specific data within a Linked Service.

If the Linked Service is the key to the building, the Dataset is the directions to a specific room inside it. It tells ADF: "The data I care about is in this container, in this folder, in this file format."

Examples:

A Dataset pointing to bronze/sales/2024/jan/*.csv in your ADLS Gen2 account
A Dataset pointing to the [dbo].[orders] table in your SQL database
A Dataset describing a Parquet file with a known schema

3. Activity: The Work

An Activity is a single step of work inside a pipeline. ADF has three categories:

Data Movement — Copy data from source to destination. The Copy Activity is the most common one you'll use.
Data Transformation — Transform data using Mapping Data Flows, Databricks notebooks, or stored procedures.
Control Flow — Logic and orchestration: If/Else conditions, ForEach loops, Wait activities, Execute Pipeline (call another pipeline).

4. Pipeline — The Workflow

A Pipeline is a logical grouping of activities that together perform a unit of work.

Your pipeline might have three activities: a Copy Activity to land raw data, a Data Flow activity to clean it, and a Stored Procedure activity to update a watermark table. Together they form one repeatable workflow.

The ETL Flow in ADF: Visualised

Here's how all four concepts connect in a real pipeline:

Build Your First Pipeline: Step by Step

Let's build a real pipeline: copy a CSV file from Azure Blob Storage into ADLS Gen2, landing it in a bronze/ folder.

What you need before starting:

An Azure account (free trial works fine)
A Storage Account with hierarchical namespace enabled (ADLS Gen2)
A CSV file uploaded to a container called source/

Step 1: Create an Azure Data Factory

Go to the Azure Portal
Search for Data Factory → click Create
Fill in the details:
- Resource Group: your existing one or create new
- Name: sales-data-factory (must be globally unique)
- Region: same as your storage account
Click Review + Create → Create
Once deployed, click Launch Studio

You're now in ADF Studio, the visual authoring environment.

Step 2: Create a Linked Service for Your Storage Account

In ADF Studio, click Manage (toolbox icon, left sidebar)
Click Linked Services → New
Search for Azure Data Lake Storage Gen2 → Select → Continue
Fill in:
- Name: ADLSGen2LinkedService
- Authentication: Account Key (simplest for now)
- Storage Account: select yours from the dropdown
Click Test Connection — you should see ✅ Connection successful
Click Create!

Step 3: Create the Source Dataset

This dataset points to the CSV file in your source/ container.

Click Author (pencil icon, left sidebar)
Click + → Dataset
Search for Azure Data Lake Storage Gen2 → Continue
Select Delimited Text (CSV format) → Continue
Fill in:
- Name: SourceCSVDataset
- Linked Service: ADLSGen2LinkedService
- File path: source/ → browse and select your CSV file
- First row as header: ✅ checked
Click OK

Step 4: Create the Sink Dataset

This dataset points to where the file should land, your bronze/ folder.

Click + → Dataset again
Same steps — Azure Data Lake Storage Gen2 → Delimited Text
Fill in:
- Name: BronzeCSVDataset
- Linked Service: ADLSGen2LinkedService
- File path: bronze/sales/ (type this manually, it doesn't need to exist yet, ADF will create it)
Click OK

Step 5: Build the Pipeline

Click + → Pipeline → name it CopySalesToBronze
From the Activities panel on the left, expand Move & Transform
Drag Copy data onto the canvas
Click the Copy Activity to open its settings:

Source tab:

Source dataset: SourceCSVDataset

Sink tab:

Sink dataset: BronzeCSVDataset
Copy behavior: PreserveHierarchy

Mapping tab:

Click Import schemas - ADF reads your CSV headers and maps columns automatically

Click Validate (toolbar) - you should see no errors
Click Debug - this runs the pipeline immediately without publishing

Step 6: Publish and Add a Trigger

Once Debug runs successfully:

Click Publish All (top toolbar) - this saves everything to ADF
Click Add trigger → New/Edit
Click New → configure:
- Type: Schedule
- Start: today's date
- Recurrence: Every 1 Day at 02:00 AM
Click OK → OK
Click Publish All again

Your pipeline now runs automatically every night at 2am, copying new sales data into your bronze layer.

Step 7: Monitor Your Pipeline

Click Monitor (chart icon, left sidebar)
You'll see all pipeline runs - status, duration, rows copied
Click any run to see activity-level details
If something fails, click the error icon to see exactly which activity failed and why

What Just Happened: The Full Picture

Let's step back and look at what you built:

This is the Extract and Load part of ETL. The file is extracted from the source container and loaded into the bronze layer, untouched, exactly as it arrived.

What Comes Next: Transform

The pipeline you built moves data. To transform it, you add one of two things after the Copy Activity:

Option 1 — Mapping Data Flow (no-code) A visual transformation canvas inside ADF. Drag and drop Filter, Join, Aggregate, Derived Column activities. Runs on Spark under the hood. Great for teams that don't want to write code.

Option 2 — Databricks Notebook Activity Call an existing Databricks notebook from your ADF pipeline. The notebook runs your Python/Spark transformation logic and writes cleaned data to the silver layer. Best for complex transformations that need code.

The full Medallion Architecture flow in ADF looks like this:

Source API / Database
        ↓
Copy Activity → bronze/ (raw data, as-is)
        ↓
Mapping Data Flow / Databricks Notebook → silver/ (cleaned, validated)
        ↓
Mapping Data Flow / Databricks Notebook → gold/ (aggregated, business-ready)
        ↓
Power BI DirectLake → Dashboard

Triggers: When Does Your Pipeline Run?

ADF gives you three trigger types:

Trigger Type	When it fires	Use case
Schedule	At a fixed time/frequency	Nightly batch loads
Tumbling Window	Fixed intervals with state	Hourly incremental loads
Storage Event	When a file arrives in storage	File-arrival driven pipelines
Manual	On demand	One-time loads, testing

For production pipelines, Storage Event triggers are the most powerful, your pipeline fires automatically the moment a new file lands in your container, with no polling or scheduling lag.

Common Mistakes Beginners Make

1. Using the same Linked Service for every environment Create separate Linked Services for dev, staging, and production. Use ADF's parameterisation to swap them out without changing pipeline logic.

2. Not testing with Debug before publishing Always Debug first. Publishing without testing means failures hit production. Debug runs don't count against your trigger history.

3. Hardcoding file paths in datasets Parameterise your datasets so the same pipeline can process different files dynamically. One pipeline, many files, not one pipeline per file.

4. No monitoring alerts Set up Azure Monitor alerts for pipeline failures. You shouldn't find out a pipeline failed when someone asks why last night's data is missing.

Key Takeaways

1. ADF is built on four concepts. Linked Services (connections), Datasets (pointers), Activities (work), Pipelines (workflows). Everything else is a variation of these four.

2. The Copy Activity is your workhorse. It supports 90+ source/sink combinations and handles schema mapping, file format conversion, and retry logic out of the box.

3. ADF is the orchestration layer, not the transformation layer. For heavy transformations, ADF calls Databricks or Data Flows, it doesn't do the transformation itself.

4. Triggers make pipelines production-ready. A pipeline without a trigger is just a script you run manually. Add a trigger and it becomes infrastructure.

5. ADF fits naturally into Medallion Architecture. Copy Activity lands data in bronze. Data Flows or Databricks jobs process silver and gold. ADF orchestrates the whole sequence.

References & Further Reading

About the Author

I'm Aditya Singh Rathore, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on RecodeHive breaking down complex concepts into things you can actually use.

🔗 LinkedIn | GitHub

📩 Stuck on a specific ADF activity or pipeline pattern? Drop your question in the comments.

Azure Storage & ADLS Gen2: Where Does Your Data Actually Live?

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Wed, 06 May 2026 00:00:00 GMT

My first week working with Azure, I broke a pipeline before it even started.

I had a simple job: land some raw CSV files from a sales API into Azure so a Spark job could pick them up later. I searched "Azure storage", saw five different options staring back at me, panicked slightly, and clicked the first one that sounded sensible - Azure Table Storage.

Three hours later, I was staring at an error I didn't understand, in a service that was never designed for files.

Table Storage is a NoSQL key-value store. It stores entities and properties, not CSV files. My data had nowhere to go.

That confusion is more common than most Azure tutorials admit. And it happens because nobody explains the one question that actually matters before anything else:

Where does your data actually live in Azure and why?

This blog answers that. We'll walk through all four Azure storage types, show exactly where each one fits in a real data pipeline, and then go deep on the one that changes everything for data engineering: Azure Data Lake Storage Gen2.

Azure Has Four Storage Types. Here's the Map.

Before we build anything, let's get oriented.

Azure bundles all storage services under a single Storage Account, one entry point, one namespace, one billing account. Inside that account, you get access to four distinct storage services, each built for a different job.

Here's the quick map before we go deeper:

Storage Type	Think of it as	Stores	Used in pipelines for
Blob Storage	A file cabinet	Any file CSV, JSON, Parquet, images, logs	Raw data landing zone
Queue Storage	A mailbox	Messages between services	Triggering pipeline steps
Table Storage	A ledger	Structured key-value rows	Tracking run state, metadata
File Storage	A shared network drive	Files accessed over SMB	Legacy app file shares

None of these is "better." They serve different stages of the same pipeline. The mistake most beginners make, including me is picking one at random instead of understanding the job each one does.

Let's walk through them in the order they matter for a real data engineering workflow.

Blob Storage: The Foundation of Everything

When data arrives in Azure, it almost always lands in Blob Storage first.

Blob stands for Binary Large Object which is just a fancy way of saying "any file." CSV, JSON, Parquet, images, videos, audio, ZIP archives, raw log dumps, Blob Storage holds all of it without caring about structure or format.

There's no schema enforcement, no type checking. You put a file in, you get it back out. At any scale.

The three blob types

Depending on how your data is written, you'll use one of three blob types:

Block Blob : Upload a file all at once. This covers 95% of data engineering use cases, your CSVs, Parquet files, JSON exports all go here.
Append Blob : Add data continuously without modifying what's already there. Perfect for log files that grow over time.
Page Blob : Optimised for random read/write operations. Used mainly for VM disks. You'll rarely touch this directly.

Access tiers: storage that adjusts to how often you actually need the data

One of Blob Storage's most underrated features is access tiering:

Hot : Data you access daily. Higher storage cost, lowest read cost.
Cool : Data you access occasionally. Cheaper to store, slightly more to read. 30-day minimum.
Archive : Data you almost never access. Extremely cheap to store, but takes hours to retrieve. Think old compliance records.

You can set lifecycle policies to move data automatically between tiers as it ages. Last month's raw files move from hot to cool. Last year's move to archive. You save money without touching anything manually.

Where Blob Storage fits in a pipeline

In Medallion Architecture, Blob Storage is the natural home for the Bronze layer, the raw, unprocessed data exactly as it arrived from source systems. Nothing is cleaned. Nothing is validated. It just lands and waits.

But here's where things get interesting.

Plain Blob Storage works perfectly for general file storage. But for big data analytics pipelines, the kind where you're processing millions of files, running Spark jobs, and building Bronze/Silver/Gold layers, it has a critical limitation that most tutorials don't mention until you've already hit it.

The Problem with Plain Blob Storage at Scale

Here's something I found out the hard way six months into working with Azure pipelines.

I had a container full of raw sales data — about 40,000 Parquet files organised under a path that looked like raw/2024/. My team decided to rename it to bronze/2024/ to match our Medallion Architecture convention. Simple enough, right?

It took 47 minutes.

Not because Azure was slow. Because what looked like a folder called raw/ was never actually a folder. In plain Blob Storage, everything lives at the same flat level, the slashes in a path like raw/2024/jan/file.parquet are just characters in a key name, the same way a filename on your desktop could technically be called raw-2024-jan-file.parquet with dashes instead.

There is no directory underneath. So renaming means Azure copies each file to the new key name and deletes the old one,one file at a time, 40,000 times in a row.

At big data scale where you're managing millions of files across Bronze, Silver, and Gold layers that's not a minor inconvenience. It's a pipeline blocker.

This is the exact problem ADLS Gen2 was built to fix.

ADLS Gen2: Blob Storage, Evolved

Azure Data Lake Storage Gen2 (ADLS Gen2) is not a separate service. It's Blob Storage with one critical feature enabled: the Hierarchical Namespace.

With hierarchical namespace turned on, folders become real. A directory with ten million files inside it can be renamed or deleted in a single atomic operation, instant, regardless of how many files it contains.

That one change makes ADLS Gen2 fast enough for serious analytics workloads. It's the storage layer that Databricks, Synapse, Azure Data Factory, and Microsoft Fabric are all built to work with.

The full ADLS Gen2 structure

ADLS Gen2 organises data in three real levels:

Storage Account
    └── Container (called a File System in ADLS)
            └── Directories (real, nested folders)
                    └── Files (your actual data)

In practice, for a Medallion Architecture pipeline:

my-datalake/
    └── data/
            ├── bronze/
            │     └── sales/
            │           └── 2024/jan/raw_orders.parquet
            ├── silver/
            │     └── sales/
            │           └── 2024/jan/cleaned_orders.parquet
            └── gold/
                  └── sales/
                        └── 2024/jan/monthly_revenue.parquet

Bronze, Silver, Gold are real directories. Spark jobs move data between them. ADF pipelines write to them. Power BI reads from them. The Medallion pattern isn't an abstract concept it's a folder structure in ADLS Gen2 with transformation logic connecting the layers.

The ABFS driver: why this matters for Spark

When Spark, Databricks, Synapse, or Fabric connect to ADLS Gen2, they use the Azure Blob File System (ABFS) driver, accessed via the abfss:// protocol.

This driver was purpose-built for analytics workloads. It's significantly faster than the old WASB driver for directory-heavy operations, and it's the reason tools like Databricks can list, read, and write millions of files in ADLS Gen2 efficiently.

Every time you see abfss://container@storageaccount.dfs.core.windows.net/ in a notebook or pipeline config, that's ADLS Gen2 being accessed via the ABFS driver.

Fine-grained access control with POSIX ACLs

Regular Blob Storage gives you Role-Based Access Control (RBAC) at the container level. ADLS Gen2 goes further with POSIX-style Access Control Lists (ACLs), the same permission model used in Linux file systems.

This means you can grant a data science team read access to only the silver/ directory, without exposing bronze/ (raw, potentially sensitive data) or gold/ (business metrics). Fine-grained, at the folder and file level.

For regulated industries - finance, healthcare, government, this isn't a nice-to-have. It's a requirement.

Storage tiers work at directory level

Just like Blob Storage, ADLS Gen2 supports Hot, Cool, and Archive tiers. But now you can apply lifecycle policies at the directory level automatically archiving bronze/2023/ partitions when they're more than a year old, while keeping bronze/2024/ hot for active pipeline use.

ADLS Gen2 is what OneLake is built on

If you've read about Microsoft Fabric, you know that OneLake is Fabric's unified data lake, the single storage layer that every Fabric workload reads from and writes to.

OneLake is fundamentally ADLS Gen2 with a unified namespace across your entire Fabric workspace. Understanding ADLS Gen2 means you understand the storage engine that powers Fabric, Synapse, Databricks on Azure, and every serious Azure data platform.

Azure Service	How it uses ADLS Gen2
Azure Data Factory	Reads source files, writes pipeline outputs
Azure Databricks	Reads/writes Delta tables via ABFS driver
Azure Synapse Analytics	Queries files directly with SQL serverless
Microsoft Fabric / OneLake	OneLake IS ADLS Gen2 unified namespace
Azure Machine Learning	Stores training datasets and model artifacts
Power BI	DirectLake mode reads Delta files from ADLS Gen2

The Supporting Cast: Queue and Table Storage

ADLS Gen2 stores your data. But a pipeline isn't just storage, it's coordination, state management, and event triggering. That's where Queue Storage and Table Storage come in.

They're not glamorous. But remove them from a production pipeline and things fall apart quickly.

Queue Storage: The Pipeline Trigger

Queue Storage stores messages, small packets of information passed between services asynchronously.

In a data pipeline context, Queue Storage is typically used as a trigger mechanism. When a new file lands in ADLS Gen2, Azure Blob Storage can emit an event that drops a message into a Queue. Azure Data Factory (or an Azure Function) listens to that Queue and kicks off the pipeline automatically.

New file lands in ADLS Gen2 bronze/
    → Event triggers a Queue message: "new file: sales_2024_jan.parquet"
    → ADF pipeline picks up the message
    → Pipeline runs transformation
    → Cleaned data written to silver/

Without Queue Storage, you'd either poll for new files on a schedule (wasteful) or trigger pipelines manually (not scalable).

Key facts:

Messages up to 64 KB in size
Queue holds up to 200 TB of messages
Messages expire after 7 days if unconsumed
Built-in retry logic if a consumer fails, the message reappears for another attempt

Table Storage: The Pipeline Memory

Table Storage is Azure's NoSQL key-value store, schemaless rows of properties, queried by partition and row key.

In data pipelines, Table Storage earns its place as the watermark store, the place that remembers where a pipeline left off.

Imagine your ADF pipeline runs every night and ingests new rows from a source database. It can't re-read everything from day one every night. Instead, it records the last_run_timestamp in a Table Storage entity:

PartitionKey: "sales_pipeline"
RowKey:       "last_run"
Timestamp:    "2024-01-15T02:00:00Z"

Next run, the pipeline reads this value, queries only rows updated since then, and updates the watermark when done. This is called incremental ingestion and Table Storage is the simplest, cheapest place to track it.

Other pipeline uses for Table Storage:

Pipeline run metadata (status, row counts, duration)
Configuration values shared across pipeline activities
Simple lookup tables for reference data enrichment

File Storage: A Quick Note

Azure File Storage provides a managed SMB file share in the cloud, the kind you mount as a network drive in Windows (\\server\share).

For data engineering pipelines, you'll rarely reach for File Storage. It's primarily useful for lift-and-shift migrations, moving on-premises applications to Azure when those applications expect to read from a network file share and you don't want to refactor them.

If you're building a new pipeline from scratch, ADLS Gen2 is almost always the right choice over File Storage for analytics workloads.

ADLS Gen2 vs Plain Blob Storage — When to Use Which

Scenario	Use
Raw file landing zone for a big data pipeline	ADLS Gen2
Serving images or videos to a web application	Blob Storage
VM disk backups or snapshots	Blob Storage
Spark / Databricks / Synapse analytics workloads	ADLS Gen2
Bronze / Silver / Gold Medallion layers	ADLS Gen2
Simple static file hosting	Blob Storage
ML training datasets and model artifacts	ADLS Gen2
Microsoft Fabric / OneLake backend	ADLS Gen2

The pricing is identical. The difference is entirely in the hierarchical namespace and the performance characteristics it unlocks for analytics.

The Full Picture: One Pipeline, All Four Storage Types

Here's how everything we've covered fits into a single, real data engineering pipeline — the kind you'd actually build in Azure:

REST API (sales data source)
        ↓
Azure Data Factory (orchestration)
        ↓ writes raw Parquet
ADLS Gen2 — bronze/sales/2024/
        ↓
Azure Databricks (Spark: clean, deduplicate, validate)
        ↓ writes Delta tables
ADLS Gen2 — silver/sales/2024/
        ↓
Azure Databricks (Spark: aggregate, calculate metrics)
        ↓ writes business-ready Delta tables
ADLS Gen2 — gold/sales/2024/
        ↓
Power BI (DirectLake mode — no import, always current)
        ↓
Business dashboard

Supporting roles:
├── Queue Storage → ADF pipeline triggered by file arrival event
└── Table Storage → watermark ("last ingested: 2024-01-15T02:00:00Z")

Every storage type has one job. None of them overlap. And ADLS Gen2 is the spine the whole thing runs on.

The Decision Guide: One Question at a Time

When you're building a pipeline and need to decide where something lives, ask these questions in order:

Is it a file that a Spark job or analytics tool needs to read? → ADLS Gen2

Is it a file served to end users (images, videos, downloads)? → Blob Storage

Is it a message that needs to trigger something downstream? → Queue Storage

Is it small structured data - a config value, a watermark, a metadata record? → Table Storage

Is it a file share that a VM or legacy app needs to mount over SMB? → File Storage

The Key Lessons

1. Azure storage is four different things. Each one has a specific job. Using the wrong one is a surprisingly easy mistake to make on day one and a frustrating one to debug.

2. ADLS Gen2 is Blob Storage with one upgrade that changes everything. The hierarchical namespace turns flat object storage into a real file system. That single feature is why every serious Azure analytics service is built on top of it.

3. ADLS Gen2 is the Bronze/Silver/Gold spine of Medallion Architecture. The layers aren't abstract concepts, they're real directories in a container, with Spark jobs and ADF pipelines connecting them.

4. Queue and Table Storage are the glue. They're not glamorous, but production pipelines depend on them for event triggering and state management.

5. OneLake is ADLS Gen2. When you use Microsoft Fabric, you're using ADLS Gen2 underneath. Understanding the storage layer means you understand what every Azure data platform is actually built on.

References & Further Reading

About the Author

I'm Aditya Singh Rathore, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on RecodeHive — breaking down complex concepts into things you can actually use.

🔗 LinkedIn | GitHub

📩 Building something on Azure and stuck on storage decisions? Drop your question in the comments.

Azure Synapse Analytics: When to Use It (And When to Choose Fabric Instead)

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Wed, 06 May 2026 00:00:00 GMT

When I first started working seriously with Azure, Synapse was the answer to almost every data question.

Need a SQL warehouse? Synapse. Need Spark for big data? Synapse. Need pipelines to move data? Synapse. Need to query files sitting in ADLS Gen2 without loading them anywhere? Synapse.

It was genuinely impressive, one workspace that brought together SQL, Spark, pipelines, and storage into a single studio. I built three production pipelines on it and it worked well.

Then Microsoft Fabric arrived.

And now the question I get asked most often is: "Should I still use Synapse, or should I move to Fabric?"

The honest answer is: it depends on where you are in your Azure journey. This blog gives you the full picture, what Synapse actually is, when it's the right call, when Fabric is the better choice, and how to think about the transition if you're already on Synapse.

What Azure Synapse Analytics Actually Is

Azure Synapse Analytics started as the next step beyond Azure SQL Data Warehouse, but over time it evolved into a much broader analytics platform rather than remaining just a cloud data warehouse solution.

What changed significantly was the addition of multiple processing engines and integrated tooling within a single workspace. Instead of working only with SQL-based warehousing, teams could now combine:

large-scale Spark processing
SQL analytics
real-time exploration capabilities
orchestration pipelines
integrated data lake access

This shift made Synapse more of a unified analytics ecosystem on Azure, where data engineering, big data processing, and reporting workloads could coexist within the same platform experience.

One of the biggest differences compared to the earlier SQL Data Warehouse model is that Synapse tries to reduce the fragmentation between storage, transformation, orchestration, and analytics services that previously had to be managed separately.

In plain terms: it's a unified analytics platform that brings together four things that used to require four separate Azure services:

SQL analytics - for querying structured data at scale
Apache Spark - for big data processing, ML, and complex transformations
Data integration (Synapse Pipelines) - for moving and transforming data across systems
A unified workspace (Synapse Studio) - where all of the above live together

The key architectural principle underneath all of this is the separation of compute and storage. This decoupling allows organizations to scale their processing power independently of their data volume, compute resources can be ramped up to handle peak query loads and then scaled down or even paused during periods of inactivity, all without affecting the underlying data stored in ADLS Gen2.

That's a big deal in practice. You pay for compute only when you use it.

The Four Core Components - What Each One Does

1. Dedicated SQL Pools: High-Performance Data Warehousing

Dedicated SQL Pools are Synapse's data warehousing engine. You provision a fixed amount of compute capacity measured in Data Warehouse Units (DWUs), and in return you get consistent, predictable query performance.

Dedicated SQL pools provision reserved compute capacity measured in Data Warehouse Units. They deliver consistent performance for production workloads, scheduled reports, and dashboards that need predictable response times.

This is the right choice when:

You have large, structured datasets that are queried repeatedly by BI tools
You need consistent sub-second query performance for dashboards
Your team works primarily in T-SQL
You're migrating from an on-premises SQL Server or Oracle data warehouse

The trade-off: you pay for the provisioned DWUs whether you're running queries or not. It's expensive to leave a Dedicated SQL Pool running 24/7 for workloads that only query it during business hours.

The practical fix: pause your Dedicated SQL Pool outside business hours. Synapse lets you do this programmatically via Azure Automation or ADF pipelines — you only pay for compute when it's actually running.

2. Serverless SQL Pool: Query Without Loading

Serverless SQL Pool is probably one of the most practical and underrated capabilities inside Azure Synapse.

What makes it interesting is how quickly you can start querying data directly from your data lake without provisioning dedicated infrastructure upfront. Instead of maintaining a constantly running cluster, the engine dynamically allocates compute only when a query is executed.

Under the hood, queries are distributed across multiple compute resources and processed in parallel, which makes it surprisingly efficient for exploratory analysis and lightweight analytical workloads.

The pricing model is also very different from traditional warehouses. Since billing is based on the amount of data scanned per query, it works particularly well for:

ad-hoc analysis
one-time investigations
querying historical files
lightweight reporting workloads
infrequently accessed datasets

The first time I used it, the biggest surprise was how quickly I could run SQL directly on files sitting in ADLS without setting up ingestion pipelines or persistent compute.

In practice: you can write a SQL query directly against Parquet, CSV, or Delta files sitting in ADLS Gen2 without loading them into any database first.

-- Query a Parquet file in ADLS Gen2 directly — no loading required
SELECT
    region,
    SUM(amount) AS total_revenue,
    COUNT(order_id) AS total_orders
FROM
    OPENROWSET(
        BULK 'https://mylake.dfs.core.windows.net/silver/sales/2024/**',
        FORMAT = 'PARQUET'
    ) AS sales_data
GROUP BY region
ORDER BY total_revenue DESC;

You pay for the bytes scanned by that query. Nothing more.

This is the right choice when:

You need to explore raw data in ADLS Gen2 before deciding how to model it
You have analysts who know SQL but don't want to write Spark code
You're running occasional ad-hoc queries that don't justify provisioning a dedicated warehouse
You want to build a logical data warehouse on top of your data lake without moving data

3. Apache Spark Pools: Big Data and ML Workloads

Azure Synapse Analytics includes deeply integrated Apache Spark capabilities, allowing teams to work with large-scale data processing directly within the Synapse workspace instead of managing separate big data platforms.

Spark Pools provide a managed Spark environment where engineers and data scientists can build ETL pipelines, prepare large datasets, process semi-structured or unstructured data, and develop machine learning workflows using familiar notebook-based development.

One thing I found particularly useful is that infrastructure management is mostly abstracted away. You can write notebooks using Python, Scala, SQL, or R while Synapse handles much of the operational overhead like cluster provisioning, scaling, and session management behind the scenes.

This makes Spark Pools especially practical for workloads that go beyond traditional SQL transformations and require distributed computation at scale.

This is the right choice when:

Your transformations are too complex for SQL alone
You're building ML pipelines or training models on large datasets
You need to process semi-structured data (JSON, nested arrays) at scale
Your data engineering team is comfortable in PySpark or Scala

The key advantage over standalone Spark clusters: Spark Pools share the same workspace as your SQL Pools and Pipelines. A Spark notebook can write a Delta table that a SQL analyst can immediately query without any data movement or cross-service configuration.

4. Synapse Pipelines: Data Integration and Orchestration

Synapse Pipelines is the data integration layer. It uses the same engine as Azure Data Factory, which means teams already using ADF will recognize the interface and functionality. Pipelines handle the movement and transformation of data across systems connecting to sources, extracting data, applying transformations, and loading results into destinations.

If you've used Azure Data Factory, Synapse Pipelines will feel immediately familiar. It's the same visual, activity-based orchestration tool with 95+ connectors to external systems, built directly into the Synapse workspace.

The advantage over standalone ADF: your pipelines live in the same workspace as your SQL and Spark workloads. You can trigger a Spark notebook, run a SQL script, and copy data to ADLS Gen2, all within a single pipeline, without leaving Synapse Studio.

What Synapse Studio Actually Looks Like

Synapse Studio is the unified web-based interface that ties everything together. From one interface, teams can write and execute SQL queries against data warehouse tables, build and run Apache Spark notebooks, design data pipelines using visual drag-and-drop tools, monitor jobs, manage resources, and configure security settings. Data engineers building pipelines and analysts writing reports work in the same environment with access to the same underlying data.

In practice, this means less context-switching. When I was building pipelines on Synapse, the biggest quality-of-life win was being able to debug a Spark notebook, run a SQL query against its output, and check the pipeline that triggered it, all in the same browser tab.

Real-World Use Cases - When Synapse Is the Right Call

Use Case 1: Enterprise Data Warehouse Migration

Organizations moving from on-premises data warehouses like SQL Server or Oracle to Azure Synapse benefit from enhanced scalability, cost savings, and better performance.

If your team is deeply invested in T-SQL, has existing stored procedures and reporting logic, and is migrating from SQL Server or Azure SQL DW — Synapse's Dedicated SQL Pool is the most natural landing spot. The syntax is familiar, the tooling is mature, and the migration path is well-documented.

Use Case 2: Ad-Hoc Exploration on a Data Lake

You've landed months of raw data in ADLS Gen2 and need to understand what's in it before building a formal pipeline. Serverless SQL Pool lets analysts write SQL against those files immediately without waiting for a data engineer to model the data first.

This is genuinely one of Synapse's strongest differentiators. No other Azure service lets SQL analysts query raw Parquet files on a data lake this directly, this cheaply.

Use Case 3: Mixed SQL + Spark Workloads

Your team has SQL analysts querying a data warehouse and data engineers running Spark transformation jobs. In most stacks, these two groups work in separate tools with separate data copies.

In Synapse, Spark can write a Delta table that the SQL pool reads, and SQL results can feed back into Spark notebooks without data movement between services. Both groups work against the same underlying data in ADLS Gen2.

Use Case 4: Regulated Industries Requiring Network Isolation

Synapse has mature support for managed virtual networks and private endpoints. For teams in finance, healthcare, or government where strict data residency and network isolation are non-negotiable requirements, Synapse's mature networking controls are a significant advantage over Fabric, whose networking story is still evolving.

Synapse vs Fabric: The Honest Comparison

Azure Synapse Analytics is a platform-as-a-service (PaaS) solution that provides modular components giving fine-grained control over data workflows. Microsoft Fabric represents a software-as-a-service (SaaS) approach bringing everything together into a single unified platform with shared governance, compute, and storage through OneLake.

Dimension	Azure Synapse	Microsoft Fabric
Deployment model	PaaS - you manage compute resources	SaaS - fully managed
Storage	ADLS Gen2 (you manage)	OneLake (unified, managed for you)
SQL engine	Dedicated + Serverless SQL Pools	Fabric Warehouse + SQL analytics endpoint
Spark	Apache Spark Pools	Fabric Spark (same engine, newer experience)
Pipelines	Synapse Pipelines (ADF engine)	Fabric Data Factory (next-gen ADF)
Real-time	Data Explorer (partially retired)	Eventstreams + Eventhouse (KQL)
Network isolation	Mature - managed VNet, private endpoints	Still evolving
T-SQL support	Full	Some gaps (OPENROWSET and others)
AI / Copilot	Limited	Built-in Copilot across all workloads
Direction	Maintenance mode	Active investment - new features land here first
Best for	Existing investments, regulated industries, SQL-heavy teams	Greenfield projects, unified analytics, AI workloads

Should You Migrate from Synapse to Fabric?

If you're already on Synapse, here's the pragmatic framework:

Migrate these workloads to Fabric now:

Spark-based data engineering notebooks and jobs
Synapse Pipelines (the migration assistant handles most of this automatically)
Real-time analytics workloads (Fabric's Eventhouse is better than Data Explorer)
Power BI-connected workloads (DirectLake mode is a significant upgrade)

Keep these on Synapse for now:

Workloads that depend heavily on Dedicated SQL Pool features
Pipelines that require complex network isolation or private endpoints
Anything using features that don't have a Fabric equivalent yet (OPENROWSET, Synapse Link for some sources)

A phased approach works best: migrate greenfield workloads to Fabric immediately, then build a roadmap for existing Synapse workloads as Fabric's feature gaps close.

The good news: the migration assistant automatically migrates core Spark artifacts from Azure Synapse Analytics into Fabric Data Engineering, bringing over Spark pools, notebooks, and Spark job definitions with no data moved during the process.

The Key Lessons

1. Synapse is not dead but it's not the future either. It's a fully supported, production-ready platform that will be around for years. But Microsoft's innovation is going into Fabric, not Synapse.

2. Serverless SQL Pool is genuinely underrated. The ability to query raw files in ADLS Gen2 with SQL, paying only for bytes scanned, is one of the most cost-efficient features in the entire Azure data stack. Even if you move to Fabric, this pattern is worth understanding.

3. For greenfield projects in 2026, start with Fabric. The OneLake architecture, the unified experience, and the Copilot integration make it the better starting point for anything new.

4. For existing Synapse investments, migrate in phases. Don't rush a full migration. Move Spark workloads and pipelines first. Evaluate Dedicated SQL Pool workloads carefully before touching them.

5. The separation of compute and storage matters. Whether you're on Synapse or Fabric, the underlying principle is the same, your data lives in ADLS Gen2 / OneLake, and your compute scales independently. Understanding this makes both platforms easier to reason about.

References & Further Reading

About the Author

🔗 LinkedIn | GitHub

📩 Still on Synapse and thinking about Fabric? Drop your questions in the comments, happy to help you think through the migration.

Why We Rolled Back Our Kafka Pipeline to Batch After 6 Months

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Wed, 06 May 2026 00:00:00 GMT

Everyone in data engineering is obsessed with real time.

Kafka. Flink. Event-driven architectures. Millisecond latency. Live dashboards. It's the direction every conference talk points, every job description asks for, every architecture diagram proudly features.

And I bought into it completely.

About a year into my data engineering career, our product team came to us with a request: customers wanted to see their order status update in real time. Our existing batch pipeline ran at 2am every night, customers were calling support asking where their orders were.

Reasonable ask. So we rebuilt the pipeline as a streaming system.

Six months later, I had learned more about the real cost of streaming than any blog post or conference talk had ever prepared me for.

This is that story — and the honest breakdown I wish someone had given me before I started.

What We Had Before (And Why It Worked)

Our original order pipeline was batch. It ran every night at 2am via Azure Data Factory, pulled 24 hours of orders from our SQL database, ran a Spark transformation job, and wrote clean Delta tables to ADLS Gen2.

Every night at 2am:
    ↓
ADF Pipeline triggers
    ↓
Pull all orders from the last 24 hours
    ↓
Spark: clean → deduplicate → join product catalog
    ↓
Write to Silver layer (Delta table on ADLS Gen2)
    ↓
Aggregate into Gold layer
    ↓
Power BI refreshes — customers see updated status

It ran in 45 minutes. Our Spark cluster spun up, did its job, and shut down. We paid for 45 minutes of compute per day. The pipeline was simple, debuggable, and recoverable, if something broke, we fixed it and replayed from Bronze.

The only problem: customers saw data that was 6 to 30 hours old depending on when they ordered.

For most use cases, that's fine. For order status, it wasn't.

Hidden Cost #1 - Infrastructure That Never Sleeps

The first thing that surprised me about our streaming pipeline was the infrastructure bill.

Our batch Spark cluster ran 45 minutes a day. Our Kafka + Flink setup runs every minute of every day - 24 hours, 7 days a week, whether there are 10 events per second or 10,000.

Streaming infrastructure requires 24/7 uptime. You can't spin it down overnight to save money. You can't schedule it during off-peak hours. The pipeline is always on, always consuming resources, always incurring cost.

For our team, the monthly compute cost for the streaming pipeline was roughly 4x what the equivalent batch job cost and that was before accounting for the additional engineering time to maintain it.

The question to ask before going streaming: Is the business value of real-time data worth 4x the infrastructure cost? Sometimes the answer is yes. Often it isn't.

Hidden Cost #2 - Late-Arriving Data Will Break Your Logic

In a batch pipeline, late data is not a problem. If an event arrives 3 hours late, it's in the next batch. The pipeline processes it, life goes on.

In a streaming pipeline, late-arriving data is one of the hardest problems in distributed systems.

Events can arrive out of order due to network delays, retries, or clock skew between services. Your Flink job is processing event #1,000 when event #987 suddenly arrives 45 seconds late. What do you do?

The answer involves watermarking, telling your stream processor "wait X seconds after the event time before closing a window, to account for late arrivals." But choosing the right watermark is a balance:

Too short: you miss late-arriving events, your aggregations are wrong
Too long: you hold state in memory longer, increasing latency and memory pressure

We got this wrong twice before landing on a configuration that worked. Both times, our order counts were silently off by 1-3%, small enough to look like noise, large enough to cause problems in financial reconciliation.

Late data problem illustrated:

Event time:  10:00  10:01  10:02  10:03  10:04
Arrived at:  10:00  10:01  10:04  10:03  10:05
                            ↑
                    event #3 arrived 2 minutes late
                    — already missed the 10:02 window
                    — your aggregate is wrong

In batch, this doesn't exist as a problem. In streaming, it's a constant engineering challenge.

Hidden Cost #3 - Exactly-Once Is Harder Than It Sounds

Handling failures in batch pipelines is usually predictable.
If a batch job fails, you typically resolve the issue and rerun the pipeline from the beginning. Since the processing happens on bounded data, recovery is relatively straightforward.

Streaming systems work very differently.

In platforms like Kafka and Flink, data is continuously flowing through the system. If a streaming job crashes midway through processing, recovery becomes much more complex than simply restarting the job.

For example, after recovery:

Should previously processed events be replayed?
Could some records get skipped unintentionally?
Is there a possibility that certain events are processed more than once?

This challenge is commonly addressed through exactly-once processing guarantees, where the goal is to ensure that every event affects the system exactly one time even during failures and restarts.

Achieving reliable exactly-once behavior usually depends on several components working together correctly:

Proper Kafka offset management
Reliable Flink checkpointing and state recovery
Idempotent writes to downstream systems
Consistent state synchronization during failover scenarios

In practice, recovery bugs in streaming systems can have real operational impact. A single restart issue can lead to duplicate event processing, inconsistent downstream data, repeated customer notifications, or inaccurate analytics until the state is corrected.

Unlike batch systems, where failures often leave datasets untouched until rerun, streaming failures can leave systems in partially updated states that are significantly harder to debug and recover from.

Hidden Cost #4 - Testing Is a Different Discipline

Testing a batch pipeline is relatively straightforward. You have a dataset, you run the transformation, you check the output. Deterministic, reproducible, easy to validate.

Testing a streaming pipeline requires simulating event streams with realistic timing, ordering, and volume. You need to test:

What happens when events arrive out of order?
What happens when a consumer crashes and restarts?
What happens when Kafka lag builds up during a traffic spike?
What happens when an upstream service sends a malformed event?

We discovered most of our edge cases in production, not in testing. Not because we were careless, but because accurately simulating a live event stream in a test environment is genuinely difficult.

Our batch pipeline had a test suite that ran in 8 minutes. Our streaming pipeline's test suite took 40 minutes and still missed three production bugs in the first month.

Hidden Cost #5 - Your Team Needs Streaming-Specific Skills

This one is easy to underestimate.

Batch data engineering skills - Spark, SQL, dbt, ADF are well-understood, well-documented, and widely held. If someone on your team leaves, finding a replacement with those skills is manageable.

Streaming-specific skills Kafka internals, Flink state management, watermarking strategies, consumer group management, exactly-once configuration are genuinely harder to find and take longer to develop.

When we hit our first major Flink issue (a state backend misconfiguration causing memory pressure under load), our team spent three days debugging something that an experienced Flink engineer would have spotted in 20 minutes. We didn't have one. We learned on the job, which is fine but it was expensive learning.

Before committing to a streaming architecture, ask: does your team have the skills to maintain it? And if not, what's the cost of developing those skills or hiring them?

So When Is Streaming Actually Worth It?

None of this means streaming is wrong. It means streaming has a real cost that should be weighed against a real business need.

Streaming is worth it when the business problem genuinely cannot tolerate batch latency. Here's a clear test:

Reach for streaming when:

Fraud needs to be detected before a transaction completes — batch latency means the fraud already happened
A customer's app needs to reflect a change within seconds of it occurring
A system needs to react to an event automatically — alerts, triggers, automated responses
You're processing IoT sensor data where stale readings are dangerous, not just inconvenient

Stick with batch when:

You're building monthly reports, financial summaries, or historical analyses
Your stakeholders check dashboards in the morning, not the second
Your transformations involve complex aggregations over large historical datasets
Your team is small and operational simplicity matters more than latency

The tech industry is currently obsessed with "real-time," which has led many organizations to over-engineer their stacks implementing complex stream-processing frameworks where a simple batch job would have sufficed. A well-built batch pipeline is more reliable, cheaper, and easier to maintain than a poorly-justified streaming one.

The Architecture That Actually Works: Both

Here's what I'd tell myself before starting that project:

You probably need both, not either/or.

Our final architecture uses batch for everything that can tolerate it, and streaming only for the specific cases that genuinely can't:

Streaming layer (Kafka + Flink):
    Order events → real-time status updates (Cassandra)
    Fraud signals → real-time alerts (notification service)

Batch layer (Spark + ADF):
    Nightly order aggregations → Silver → Gold (Power BI)
    Monthly revenue reports (finance team)
    ML training datasets (data science team)

The streaming layer handles the 5% of use cases where seconds matter. The batch layer handles the 95% where they don't , more reliably, more cheaply, with less operational overhead.

Microsoft Fabric is built around exactly this pattern, Eventstreams for real-time ingestion, ADF Pipelines and Spark Notebooks for batch transformation, both writing to the same OneLake. You don't have to choose one architecture. You choose the right tool for each use case within the same platform.

The Honest Summary

	Batch	Streaming
Infrastructure cost	Low - runs on schedule	High - always on
Latency	Minutes to hours	Milliseconds to seconds
Late data	Not a problem	Significant engineering challenge
Failure recovery	Fix and rerun	Complex - risk of duplicates or data loss
Testing	Straightforward	Requires stream simulation
Team skills needed	Spark, SQL, ADF	Kafka, Flink, state management
Best for	Analytics, reporting, ML	Fraud detection, live status, alerts
Operational complexity	Low	High

Streaming pipelines are powerful. They enable product experiences that batch simply can't deliver.

But they come with real costs - infrastructure that never sleeps, late-data handling that never stops being tricky, failure recovery that's genuinely hard to get right, and a skills requirement that's easy to underestimate.

The next time someone on your team says "we should make this real time", ask the question first:

How long can the business actually wait for this data?

If the honest answer is "overnight is fine" — keep the batch job. It's not boring. It's the right call.

References & Further Reading

About the Author

I'm Aditya Singh Rathore, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on RecodeHive, turning hard-won lessons into content anyone can learn from.

🔗 LinkedIn | GitHub

📩 Have you been burned by a streaming pipeline that didn't need to be? Drop it in the comments.

How Netflix Handles 2 Trillion Events Every Day

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Mon, 04 May 2026 00:00:00 GMT

Right now, someone is pausing Stranger Things at the exact moment a jump scare hits.

Someone else just searched "action movies" and clicked the third result. Another person skipped the intro of a show they've watched five times. And somewhere, a user on a slow connection just had their video quality automatically drop from 4K to 1080p, without any buffering, without any prompt.

Every single one of these actions is an event. And Netflix captures all of them from 300 million subscribers across 190 countries, continuously, in real time.

The scale: 2 trillion events every single day. That's 3 petabytes of data ingested, 7 petabytes output, at a peak rate of 12.5 million events per second. The system behind all of this is called Keystone - Netflix's internal real-time data pipeline, and understanding how it works is one of the most instructive case studies in modern data engineering.

The Scale Problem: Why This Is Actually Hard

Most people assume Netflix's hard problem is streaming video. It's not. The hard problem is streaming data about video.

Every time you interact with Netflix, dozens of microservices each emit their own events simultaneously. A single "press play" triggers events from the playback service, the recommendation service, the quality-monitoring service, the CDN routing service, and more, all at the same time. Now multiply that by 300 million concurrent users across different time zones.

Before Keystone, Netflix ran a batch pipeline built on Chukwa, Hadoop, and Hive. By 2015, logging volume had grown to 500 billion events per day and the system was collapsing. Netflix estimated they had six months to rebuild it as a streaming-first architecture before it failed completely under subscriber growth.

That pressure is why every architectural decision in Keystone was made under real production constraints not theoretical design.

Keystone processes 2 trillion events/day — 3PB ingested, 7PB output daily. Source: Netflix Engineering

What Is an Event, Exactly?

An event is a small structured record, typically a few kilobytes that captures a single thing that happened. Every event at Netflix carries a consistent set of core fields:

{
  "event_id":   "uuid-1234-abcd",
  "event_type": "play_start",
  "user_id":    "u_98765432",
  "device_id":  "d_iPhone15",
  "title_id":   "t_StrangerThings_S4E1",
  "timestamp":  "2026-05-04T18:32:11.452Z",
  "session_id": "s_abc123",
  "region":     "IN",
  "quality":    "1080p",
  "network":    "WiFi"
}

Netflix generates hundreds of distinct event types across all its services:

play_start, play_pause, play_stop, seek
search_query, search_result_click
scroll_position, title_hovered, row_impression
buffer_start, buffer_end, quality_change
error_occurred, playback_failed
ab_test_assignment, recommendation_shown

Each event type has its own schema, its own set of required and optional fields, data types, and validation rules. Managing thousands of schemas across hundreds of microservice teams is itself a major engineering problem. That's exactly what the Schema Registry (covered below) was built to solve.

The event above looks simple. But when you're ingesting 12.5 million of them every second, the engineering required to make that reliable without data loss, without duplicates, without schema corruption is anything but simple.

The Architecture: Keystone, Kafka, and Flink

Before diving into individual tools, watch this first. Flink Forward's breakdown gives you the visual mental model that makes the rest of this article click into place:

Keystone: The Platform That Wraps Everything

Most articles jump straight to Kafka and Flink. But the important thing to understand first is Keystone : the internal platform that manages the entire pipeline as a service.

Keystone is not a single open-source tool. It's Netflix's purpose-built Stream Processing as a Service (SPaaS) platform built on top of Kafka and Flink. It provides:

A Data Pipeline layer: handles event ingestion, routing, and delivery to all downstream sinks (S3, Elasticsearch, secondary Kafka topics)
A Stream Processing layer: lets any Netflix engineering team deploy and run custom Flink jobs without managing the underlying infrastructure themselves
A Control Plane: manages job configuration, deployment via Spinnaker, health monitoring, and self-healing. Every job's desired state is stored in AWS RDS, if a Kafka cluster goes down, it can be fully reconstructed from RDS alone

Think of Keystone as the operating system for data at Netflix. Kafka and Flink are the engines. Keystone is the layer that makes them usable, self-service, and reliable across thousands of internal teams.

📖 Keystone Real-time Stream Processing Platform — Netflix Tech Blog

The full pipeline architecture:

Layer 1: Event Capture: Suro and the API Gateway

When a Netflix microservice emits an event, it has two paths into Kafka: style="border: none;"

Direct Kafka write via a Java client library, for high-throughput services that need maximum speed
HTTP POST via Suro : Netflix's internal event collection proxy for services in Python or other languages

Both paths end at the same place: a Kafka topic. The critical design principle here is capture first, process never at the entry point. The gateway does minimal validation, is the schema registered? does the payload match? and then writes immediately. No enrichment, no business logic, no database calls.

At 12.5 million events per second, even a 1-millisecond database call per event would require 12,500 concurrent database operations per second at the gateway alone. Keeping the entry point stateless is what makes the pipeline scale.

Layer 2: Apache Kafka: The Heart of the Pipeline

Apache Kafka is the backbone of Keystone. Every event from every microservice flows through Kafka before going anywhere else.

Topic-per-event-type architecture:

Netflix follows a strict rule: one Kafka topic per event type. Hundreds of topics run in parallel — play_events, search_events, error_events, quality_events, and so on. This isolation means a spike in error events during an outage doesn't slow down play event processing, and each topic can have its own retention policy, replication factor, and partition count independently tuned.

Durability profiles:

Netflix configures Kafka with different durability levels depending on how critical the data is. For AP (Availability over Consistency) use cases - analytics events where losing a tiny fraction is acceptable, they allow unclean leader election, trading perfect consistency for never going down. For CP (Consistency over Availability) use cases - billing events, legal audit logs, they require clean leader election with no data loss possible.

Avro + Schema Registry - the data contract:

Every event in Kafka is encoded in Apache Avro, a compact binary format that is 3-5x smaller than JSON and significantly faster to parse. But more importantly, every Avro schema is registered in a centralised Schema Registry before any event can be written.

When a team deploys a bad change that sends a malformed event - wrong field type, missing required field, Kafka rejects it at the producer. It never enters the pipeline. At 2 trillion events per day, an undetected schema mismatch could corrupt petabytes of downstream data before anyone notices. Schema enforcement at the source is what prevents this.

📖 How Netflix Uses Kafka for Distributed Streaming — Confluent

Kafka organises events into topics with partitions — parallel consumption by multiple downstream systems simultaneously. Source: Conduktor

Retention and replay:

Kafka doesn't store events forever. Netflix sets retention policies per topic, high-volume topics might retain data for hours, lower-volume ones for days. The safety net: all Kafka records are also persisted to Apache Iceberg tables on S3. If a downstream Flink job fails and needs to reprocess events that have already expired from Kafka, it reads from Iceberg instead. The pipeline is fully replayable.

Layer 3 - Apache Flink: Where Raw Events Become Useful Data

Kafka stores and delivers events reliably. But events in a queue don't power recommendations or dashboards. They need to be processed and that's Apache Flink's job.

Flink jobs run continuously, 24/7, consuming from Kafka topics in near real time. A typical Flink job in Keystone runs this chain of operations:

Filter → Remove noise: system health pings, internal test events, bot traffic, malformed records that slipped past schema validation.

Enrich → A raw play_start event only contains user_id, title_id, and timestamp. Downstream systems need the show's genre, the user's country, the content rating. Flink enriches events by joining with side inputs, a small reference datasets loaded into Flink task memory, so enrichment happens locally without any network calls.

Deduplicate → Devices retry failed requests. The same event can arrive in Kafka twice. Flink maintains a short time-window buffer in RocksDB (an embedded key-value store local to each Flink task), comparing event IDs and dropping duplicates before they reach storage.

Transform → Reshape the enriched event into the exact schema that each downstream storage system expects.

Window → Aggregate events across time. "Count all play_start events in the last 60 seconds, grouped by country and device type." This is how Netflix's real-time operations dashboards get live numbers updated every minute.

The 1:1 lesson Netflix learned the hard way:

Netflix initially tried one monolithic Flink job consuming all Kafka topics. It was a disaster. Different topics have wildly different volumes and burst patterns, play events spike on Friday evenings, error events spike during CDN outages making it impossible to tune a single job for all of them without constant instability.

Their solution: one dedicated Flink job per Kafka topic. More jobs to operate, but each can be independently scaled, monitored, and tuned. A problem in the error_events Flink job doesn't affect the play_events Flink job. This is a real architectural lesson: operational simplicity at the individual job level outweighs the overhead of managing more jobs.

📖 Migrating Batch ETL to Stream Processing at Netflix — InfoQ

A Flink job pipeline: events enter from Kafka, flow through processing operators, and are written to storage sinks. Source: Apache Flink Docs

Layer 4 - Storage: Three Databases, Three Jobs

Processed events are routed to three different storage systems depending on how they'll be accessed:

Apache Cassandra - for millisecond reads at scale: Powers anything that needs to be fast, your Continue Watching row, personalised home screen, real-time recommendation updates. Cassandra is a distributed NoSQL database with no single point of failure, designed for massive write throughput. Netflix's Cassandra deployment spans thousands of nodes across multiple clusters and scales linearly.

Apache Iceberg on S3 - for analytical queries: Long-term storage for ML model training, A/B test analysis, and content strategy decisions. Iceberg adds ACID transactions, time travel, and schema evolution on top of cheap object storage. The same data that flowed through Kafka and Flink in real time is also persisted here for batch processing. It's also the replay source when Kafka retention expires.

📖 Apache Iceberg — the open table format

Elasticsearch - for observability: Operational events, errors, latency spikes, quality degradations are indexed here and power Netflix's internal engineering dashboards. When an on-call engineer needs to know "how many buffering events happened in the last 5 minutes in Southeast Asia," they're querying Elasticsearch.

Connecting the Tech to Real UX

Here's what all of this actually produces for a real Netflix user:

Your home screen is personalised in near real time. Every show you watch, every row you scroll past, every search you run — these events flow through Keystone within seconds and update your taste profile in Cassandra. The next time you open Netflix, the home screen reflects what you did in the last hour, not just your all-time history.

Thumbnails change based on what works for you personally. Netflix runs thousands of A/B thumbnail tests simultaneously. The event pipeline tracks which thumbnails led to a play and which were ignored and automatically serves the winning variant to users with similar taste profiles. All measured through events.

Video quality adjusts seamlessly before you notice. Quality-change events flow through Kafka and Flink in milliseconds. When Netflix detects your connection degrading, the pipeline routes a signal to the playback service before your buffer empties. You never see a spinner.

Content decisions are driven by event data. Which shows do people abandon after episode 1? Which genres drive subscription upgrades in specific markets? This runs as Spark batch jobs on Iceberg tables, billions of events informing which content Netflix commissions and licenses next.

Every row on your home screen — Top Picks, Continue Watching, Trending — is powered by events processed through Keystone in near real time. Source: Netflix

5 Lessons for Your Own Data Pipeline

Netflix's pipeline wasn't built in a day, it evolved through failures, rewrites, and hard-won production lessons over more than a decade. Here are five principles every data engineer can apply at any scale:

1. Capture first, process never at ingestion. Your event collection layer should do one thing: receive events and write them to a durable queue. No enrichment, no business logic, no database calls at the entry point. Anything you add there compounds into a bottleneck at scale. Keep ingestion stateless and fast.

2. Schema enforcement is your safety net, invest early. At any meaningful scale, a single bad deploy can silently corrupt your entire pipeline without schema validation. Invest in a Schema Registry before you need it. Avro or Protobuf with centralised validation means malformed events are rejected at the source, not discovered days later in broken downstream tables when the damage is already done.

3. One job per topic beats one monolith for all topics. If you're using Flink or Spark Streaming, resist the temptation to build one big job that handles everything. Separate topics have different volumes, burst patterns, and latency requirements. A dedicated job per topic means you can tune, scale, monitor, and fix each independently and a failure in one doesn't cascade to others.

4. Match storage to access pattern, not convenience. Cassandra for millisecond point reads. Iceberg or Delta Lake for analytical queries over billions of rows. Elasticsearch for full-text and observability queries. These are not interchangeable. The most common mistake is picking one database for everything and then wondering why queries are slow. Design your storage tier around query patterns first.

5. Build for replay from day one. Pipelines fail. Jobs crash. Kafka topics expire. If you can't reprocess historical events, every failure is permanent data loss. Before you ship your first pipeline, answer: if this job needs to reprocess last week's data tomorrow, where does it read from? Netflix answers this with Iceberg as the replay source. You need your own answer before you go live.

The Numbers, In Context

Metric	Value
Daily events processed	2 trillion
Data ingested per day	3 petabytes
Data output per day	7 petabytes
Peak throughput	12.5 million events/second
Subscribers generating events	300M+ across 190 countries
Kafka topics	Hundreds, one per event type

Every number here represents a real engineering constraint that forced a specific architectural choice. The scale is impressive. The principles behind it are what actually matter.

Wrapping Up

The next time Netflix recommends something that feels uncomfortably accurate, or your video quality silently adjusts on a slow connection, or your Continue Watching row picks up exactly where you left off on a different device, that's 2 trillion events per day, flowing through Keystone, processed by Flink, stored in Cassandra and Iceberg, translating raw user actions into a product experience that feels effortless.

The pipeline is invisible. That's exactly the point.

For data engineers, the real takeaway isn't the scale. It's the principles. Capture fast. Enforce schemas. Separate concerns. Match storage to access patterns. Build for replay. These apply whether you're handling 2 trillion events or 2 thousand.

References & Further Reading

About the Author

I'm Aditya Singh Rathore, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, system design, and real-world architectures on RecodeHive, breaking down complex systems into concepts anyone can learn from.

🔗 LinkedIn | GitHub

📩 Building a real-time pipeline? Drop your questions in the comments below.

How SSO Works - Case Study

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Mon, 04 May 2026 00:00:00 GMT

You've done this a hundred times without thinking about it.

You land on a website, maybe LinkedIn, maybe Spotify, maybe some random productivity app and instead of creating yet another account with yet another password, you just click "Sign in with Google."

Two seconds later, you're in.

No new password. No verification email. No "must contain one uppercase, one number, and the soul of a forgotten god." Just... in.

That's Single Sign-On (SSO) at work. And once you understand how it actually works under the hood, you'll see it everywhere.

The Master Key Analogy

Think of SSO like a master key for a hotel.

Every room in the hotel has its own lock - the gym, the pool, the restaurant, your room on the 7th floor. Normally, you'd need a separate key for each one. That would be exhausting.

Instead, the front desk gives you one key card when you check in. That single card opens every door you're allowed through, for the entire stay.

SSO works the same way. You prove who you are once. Everything else just opens.

Two Characters You Need to Know

Before we walk through the login flow, meet the two players involved:

1. Identity Provider (IdP) - This is the entity that knows who you are. Google, Microsoft, Apple - these are common Identity Providers. They hold your credentials and vouch for your identity.
1. Service Provider (SP) - This is the app or website you're actually trying to use. LinkedIn, GitHub, Notion, Slack - these are Service Providers. They don't store your password. They just trust the Identity Provider's word.

The whole dance of SSO happens between these two.

How It Actually Works: Step by Step

Let's walk through a real example - logging into LinkedIn using Google.

Step 1 - You knock on the door

You visit LinkedIn and click "Sign in with Google."

LinkedIn (the Service Provider) doesn't ask for your password. Instead, it says: "I don't know this person. Let me send them to Google."

Step 2 - LinkedIn redirects you to Google

LinkedIn sends you over to Google with an authentication request — essentially a note that says: "Hey Google, can you confirm who this person is?"

Step 3 - Google checks if you're already logged in

Google (the Identity Provider) looks for an active session on your browser.

If you're already logged into Google → it skips straight to step 6. No password needed.
If you're not logged in → it asks for your credentials.

Step 4 - You enter your Google credentials

You type in your Google email and password. This is the only place your credentials go. LinkedIn never sees them. Ever.

This is actually one of the biggest security wins of SSO — your password lives in one place, with one trusted provider, instead of being scattered across dozens of apps.

Step 5 - Google verifies who you are

Google checks your credentials against its own database. If everything matches, it doesn't just let you in — it creates something called an authentication token (think of it as a signed, digital stamp of approval).

Step 6 - Google sends that token back to LinkedIn

Google hands the token to LinkedIn. The token essentially says: "This person is who they say they are. I, Google, can confirm it."

LinkedIn trusts Google's word, reads the token, and lets you in — without ever having touched your password.

Step 7 - The magic of the existing session

Here's where SSO really earns its name.

Later that day, you open GitHub and click "Sign in with Google." GitHub sends the same authentication request to Google. But this time, Google already has an active session from when you logged into LinkedIn.

So instead of asking for your password again, Google just says: "Yep, I know this person. Here's their token."

You're in GitHub instantly. No password. No friction.

One login. Many doors.

The Protocols Behind the Scenes

SSO isn't magic - it runs on a set of agreed-upon rules that tell the Identity Provider and Service Provider how to talk to each other and how to trust each other. These rules are called protocols.

The three most common ones you'll hear about:

SAML (Security Assertion Markup Language) - the older, enterprise-friendly protocol. You'll find it in corporate SSO setups, think logging into your company's internal tools with your work email.

OpenID Connect - the modern, developer-friendly protocol built on top of OAuth. This is what powers most "Sign in with Google" buttons you see on consumer apps today.

OAuth - technically an authorization protocol (not authentication), but often used alongside OpenID Connect. It's what handles the "allow this app to access your Google account" permissions screen.

You don't need to memorize the differences right now. Just know that when SSO works smoothly, one of these protocols is doing the heavy lifting in the background.

Why Does Any of This Matter?

SSO isn't just a convenience feature. It solves real problems:

1. For users: Fewer passwords to remember means fewer weak passwords, fewer forgotten passwords, and fewer "reset my password" spirals at 11pm.
1. For security teams: When an employee leaves a company, revoking access to one Identity Provider cuts off access to every connected app instantly — instead of hunting down 30 individual accounts.
1. For developers: Building an app with SSO means you don't have to manage password storage, reset flows, or authentication security yourself. You offload all of that to a provider like Google or Microsoft that is very, very good at it.

The One Thing to Remember

If you take nothing else from this:

SSO means you prove your identity once, to one trusted provider, and that proof travels with you across every connected app.

Next time you click "Sign in with Google," you'll know exactly what's happening behind that button — a quiet handshake between two systems, so you don't have to think about it at all.

Enjoyed this? I write about data engineering, system design, and the concepts that actually matter in tech — without the jargon.

🔗 LinkedIn | GitHub

Delta Lake: An Introduction to Trustworthy Data Storage

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Fri, 01 May 2026 00:00:00 GMT

There Is Something Wrong With Your Data Lake

Imagine this: your firm receives hundreds of records per hour, be it users signing up for an account, making purchases, or using your mobile application. You store all these records in a data lake, which is hosted on the cloud. Got it?

Now, imagine something happening to this system. Two pipelines write to the same table simultaneously, overwriting each other. And now half of your data is gone. No one notices until it becomes obvious in the weekly report.

The issue described above is a common one when using traditional data lakes. The thing is that data lakes were created to solve a different problem, one of storing information rather than ensuring its reliability. And that's what Delta Lake is designed to solve.

What is Delta Lake, in Plain English?

Consider a traditional data lake to be a folder in Google Drive, where anyone has the ability to edit or even delete anything inside without leaving an audit trail or version history. What if that folder was:

1. Version-controlled and could be rolled back to any previous state
1. Guaranteed to have a clean schema
1. Structured such that bad data can't possibly get stored
1. Secure against race conditions when used by multiple writers

This folder would be a Delta Lake. It operates over the storage already provided for your organization and makes all those promises without asking you to move off your storage infrastructure.

The Four Unique Features of Delta Lake

1. ACID Transactions: Corruption-Free Data!

ACID Transactions are Atomicity, Consistency, Isolation, and Durability. It is not mandatory to memorize these terminologies, but it is essential to understand how they operate. Delta Lake provides us a guarantee that when two processes attempt to modify the same dataset, none of them will overwrite the other's modification. Each process either proceeds or waits for their turn, which gives us consistency in our data like a queue at the cashier.

2. Time Travel: The "Undo" Feature

When working with a Delta table, all of your operations are kept in versioning. Accidentally deleted a record? Performed a bad update operation? With the time travel feature, we can revert changes and query the data at any point in time in history of our table.

3. Schema Enforcement: Bad Data Rejection

Suppose that your schema requires a certain field to only contain numerical values while another client attempts to send you a record that contains a string. In this case, Delta Lake blocks this row from being entered into the dataset.

4. Schema Evolution – Evolving without Breaking Anything

As your product matures, so does your data. Want to add an extra column? Delta Lake makes schema evolution easy – your data remains untouched while your workflows continue uninterrupted.

And How Exactly Does That Work?

All the magic above happens because of a mechanism known as the Transaction Log, and it’s kept in a folder named _delta_log within your table itself. Every individual action, be it inserting, deleting, or updating records, is logged in a JSON format within that log. Delta Lake relies on this transaction log to keep track of the latest status of your table, and which older files can be safely deleted from the system.

Here’s how your table appears on the disk:

my_table/
├── _delta_log/
│   ├── 00000000000000000000.json   ← "Table was created"
│   ├── 00000000000000000001.json   ← "10 rows were added"
│   └── 00000000000000000002.json   ← "Salary column was updated"
├── part-00001.parquet
├── part-00002.parquet
└── part-00003.parquet

The real data is stored in Parquet files, which are highly efficient in terms of querying. The transaction log is the brain, and the Parquet files are the data store..

Let's Write Some Code

Setting Up

pip install delta-spark pyspark
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

builder = SparkSession.builder \
    .appName("MyFirstDeltaTable") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

Creating a Delta Table

# Let's create a simple employee dataset
employees = [
    (1, "Priya Sharma", "Engineering", 82000),
    (2, "Liam O'Brien", "Marketing", 67000),
    (3, "Yuki Tanaka", "Engineering", 91000),
    (4, "Carlos Mendez", "Sales", 74000),
]
columns = ["id", "name", "department", "salary"]

df = spark.createDataFrame(employees, columns)

# Save it as a Delta table
df.write.format("delta").mode("overwrite").save("/data/employees")

That's it. You now have a Delta table with a transaction log, version history, and all the reliability features built in automatically.

Reading It Back

df = spark.read.format("delta").load("/data/employees")
df.show()

+---+-------------+------------+------+
| id|         name|  department|salary|
+---+-------------+------------+------+
|  1| Priya Sharma| Engineering| 82000|
|  2| Liam O'Brien|   Marketing| 67000|
|  3|  Yuki Tanaka| Engineering| 91000|
|  4|Carlos Mendez|       Sales| 74000|
+---+-------------+------------+------+

Using Time Travel

Let's say you update some salaries, then realize the update was wrong:

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/data/employees")

# Give everyone in Engineering a raise
delta_table.update(
    condition="department = 'Engineering'",
    set={"salary": "salary + 5000"}
)

Oops! turns out that update was wrong. No panic. Just travel back to version 0:

# Check the history first
delta_table.history().show()

# Read the original data before the update
original_df = spark.read \
    .format("delta") \
    .option("versionAsOf", 0) \
    .load("/data/employees")

original_df.show()

You get your original data back, untouched. You can restore it, compare it, or just use it to figure out what went wrong.

Inserting and Updating at the Same Time (MERGE)

One of the most useful everyday operations is MERGE, often called an upsert. It means: update the record if it exists, insert it if it doesn't.

# Some incoming data -- one update, one brand new employee
incoming = [
    (2, "Liam O'Brien", "Marketing", 71000),  # salary updated
    (5, "Amara Osei", "HR", 69000),            # new employee
]

incoming_df = spark.createDataFrame(incoming, columns)

delta_table.alias("existing").merge(
    incoming_df.alias("new"),
    "existing.id = new.id"
).whenMatchedUpdate(set={
    "salary": "new.salary"
}).whenNotMatchedInsert(values={
    "id":         "new.id",
    "name":       "new.name",
    "department": "new.department",
    "salary":     "new.salary"
}).execute()

One operation. No duplicates. No manual checking. Clean results every time.

Keeping Your Table Healthy

Over time, Delta Lake accumulates old data files for time travel. You'll want to periodically clean those up:

# Remove files older than 7 days
spark.sql("VACUUM delta.`/data/employees` RETAIN 168 HOURS")

And if your table gets many small files over time (which slows down queries), compact them:
python
# Compact small files into larger, more efficient ones
spark.sql("OPTIMIZE delta.`/data/employees`")

Think of VACUUM as taking out the trash and OPTIMIZE as reorganizing your desk. Both are good habits to run on a schedule.

When Should You Utilize Delta Lake?

Delta Lake is perfect for use when:

1. There are several pipelines or multiple parties writing to the same data set.
1. An audit history of all changes is necessary.
1. The schema of your data can change.
1. You would like to detect any data that could cause problems.
1. Real-time streams and batch historical data are being combined.

If you have static files that are never going to be changed, then regular Parquet will be sufficient. However, the second your data becomes dynamic, it's worth its weight in gold.

Conclusion

In essence, Delta Lake starts with taking the idea of a data lake – low-cost, scalable, and flexible storage – and makes it reliable. The ACID transaction model eliminates silent corruptions, time travel allows you to get back your data on any mistake, while schema enforcement prevents bad data from entering your system, while at the same time schema evolution makes sure your data stack evolves easily.

And at the heart of this system lies nothing else but a transaction log – an easy and audit-ready record of every transaction made to your data.

When it comes to building data pipelines where data quality really matters – which happens sooner or later – Delta Lake cannot be anything else but the base of your stack. But most importantly, it’s very easy to implement.

How I cleared DP-700 Certification Exam

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Fri, 01 May 2026 00:00:00 GMT

If you're a data engineer working in the Microsoft ecosystem, Microsoft Fabric is impossible to ignore , and the DP-700 certification is one of the best ways to prove you understand it. I recently cleared the Microsoft DP-700: Fabric Data Engineer Associate exam, and this is an honest breakdown of how I did it, what actually helped, and what you should skip.

What Is Microsoft Fabric, Really?

Before diving into the prep strategy, let's quickly address what makes Fabric different.

Microsoft Fabric is not just another Azure tool. It's Microsoft's attempt to merge your entire modern data stack into a single platform — data engineering, data science, data warehousing, real-time analytics, and Power BI, all under one roof.

Think of it this way: earlier, you had Azure Data Factory for orchestration, Synapse for warehousing, and Power BI for reporting — three separate tools with separate setups and billing. Fabric brings all of that together in one unified experience.

This shift in architecture is exactly why the DP-700 exam feels different from other Azure certifications. It's not about memorizing service names — it's about understanding how these pieces fit together in real-world data solutions.

About the DP-700 Exam

Detail	Info
Full Name	Microsoft Fabric Data Engineer Associate
Level	Associate
Format	MCQs + Case Studies
Difficulty	Medium (concept-heavy, not definition-heavy)
Focus	Real-world architecture and decision-making

One important reality check: this is not a memorization exam. If you go in trying to rote-learn definitions, the scenario-based questions will catch you off guard. The exam tests whether you can make the right architectural decision — not whether you can recite what a Lakehouse is.

My Preparation Strategy

1. Microsoft Learn — Your Non-Negotiable Starting Point

Start here, period. The Microsoft Learn paths for DP-700 are well-structured and align closely with the actual exam topics. They cover all the core concepts across Fabric's components.

That said, Microsoft Learn alone is not enough. Think of it as building your foundation — you still need to put that foundation to work.

2. Hands-On Practice — The Actual Game Changer

This is where most candidates underinvest, and it shows on exam day.

I spent dedicated time:

Creating and exploring Lakehouses
Building and running Data Pipelines
Working with Dataflows Gen2
Exploring the Fabric UI thoroughly (this matters more than you think)

Microsoft Fabric has a free trial. Use it. The exam includes scenario questions where you need to navigate or reason about the interface. If you've never seen it, you'll struggle to answer those questions confidently.

3. Practice Tests — Learn to Eliminate, Not Just Recall

Practice tests serve two purposes. First, they show you where your weak areas are. Second, and more importantly, they teach you how to approach tricky answer options.

Many DP-700 questions have two options that look almost identical. The skill you're actually being tested on is eliminating the wrong answer ,not picking the right one from memory. Practice tests train that skill.

4. YouTube for Concept Clarity

Whenever a concept didn't fully click after reading, I turned to YouTube. Sometimes a 10-minute video does what 2 hours of documentation can't. Particularly useful for visual concepts like DirectLake mode, Delta Table versioning, and pipeline orchestration flows.

Key Concepts You Must Know

These are the areas that carry the most weight in the exam. If any of these feel unclear, go back and invest time here before moving forward.

Lakehouse

The Lakehouse is the central concept in Microsoft Fabric. It combines the flexibility of a Data Lake with the structure of a Data Warehouse. If this concept isn't solid, everything built on top of it will feel unstable.

Data Pipelines vs. Dataflows Gen2

A common trap in the exam is knowing when to use each:

Pipelines → Orchestration (similar to Azure Data Factory). Use for scheduling, triggering, and controlling the flow of data.
Dataflows Gen2 → Transformation. Use for cleaning, shaping, and preparing data using a Power Query-like interface.

The exam loves to test this distinction with scenario questions.

Delta Tables

Delta Tables are the backbone of storage in Fabric. Key areas to understand:

ACID transaction support
Time travel and versioning
How Delta integrates with the Lakehouse

Power BI and DirectLake Mode

DirectLake is one of Fabric's most important innovations — it allows Power BI to query data directly from the Lakehouse without importing it, while still delivering near-import performance. This appears in multiple exam scenarios.

Workspace and Security Model

Understand roles, permissions, and how access is managed across Fabric items. Security-related questions appear more than people expect.

My Study Timeline

This is what actually happened — not an ideal plan, but an honest one:

Week 1 — Went through Microsoft Learn modules and explored the Fabric UI (a lot of clicking around to understand the platform)
Week 2 — Hands-on practice: built pipelines, created Lakehouses, ran Dataflows, explored Delta Tables
Week 3 — Practice tests, identified weak areas, revised those topics, and did a final pass on key concepts

Some days I studied 3–4 focused hours. Some days were slower. Consistency over intensity is what got me through.

Exam Day — What It Actually Felt Like

Here's a realistic walkthrough of the experience:

First few questions: Straightforward — concepts you've covered
Middle section: Scenario-based questions where two options look very similar. This is where hands-on familiarity pays off.
Case studies: Time-consuming but manageable if you understand architecture well
End section: A few questions that feel unexpected — stay calm, apply what you know

Key observations from exam day:

Time management matters. Don't spend 10 minutes on one question.
Read each question fully before looking at options.
Scenario questions reward understanding, not recall.

What to Do (and What to Avoid)

Do this:

Practice hands-on inside Fabric (free trial is available)
Understand the why behind architectural choices, not just what each component does
Learn from practice test mistakes — review every wrong answer
Revise your weak areas before the exam, not your strong areas

Avoid this:

Trying to memorize definitions — the exam will test application, not recall
Skipping the UI experience — you need to recognize Fabric's interface
Ignoring practice tests — they're the closest thing to the real exam experience

Is DP-700 Worth It?

Yes, if:

You're a data engineer or data professional working with Microsoft technologies
You're building or designing modern data platforms
You want to position yourself for roles that involve Microsoft Fabric, Synapse, or Power BI

Not essential if:

You have no plans to work in the Microsoft data ecosystem
You're focused on non-data engineering roles

Final Thoughts

Microsoft Fabric is still maturing, but its direction is clear — Microsoft is consolidating the modern data stack into a single platform, and it's gaining adoption fast. Understanding Fabric deeply, not just passing an exam on it, is genuinely useful right now.

The DP-700 is a solid way to validate that understanding. Approach it with real hands-on practice and a focus on concepts over definitions, and you'll be in a good position on exam day.

Useful Resources

Have questions about DP-600 prep or Microsoft Fabric? Drop a comment below — happy to help.

Connect on LinkedIn | GitHub

Lakehouse vs Data Warehouse: What's the Difference and When to Use Each

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Fri, 01 May 2026 00:00:00 GMT

I made a mistake in my second month as a data engineer.

Our startup was growing fast, three data sources had become twelve almost overnight. Product events from Mixpanel, orders from Shopify, support tickets from Zendesk, raw logs from our backend. I needed everything in one place, queryable, fast.

So I did what made sense at the time: I dumped everything into our Snowflake warehouse. Raw JSON blobs, unnested arrays, half-cleaned API responses — all of it, straight in.

Three weeks later, our BI team couldn't trust a single number. Our schema was a mess. Re-ingesting data cost us real money. And every new data source I added made things worse, not better.

That mess is what taught me the real difference between a Lakehouse and a Data Warehouse and more importantly, why you almost always need both.

What Is a Data Warehouse?

After my Snowflake disaster, a senior engineer on the team pulled me aside and said something I didn't fully appreciate at the time:

"A warehouse is not a dumping ground. It's a showroom."

He was right. The Data Warehouse has been the backbone of business intelligence for decades precisely because it enforces discipline. Data must be cleaned and structured before it enters. No exceptions.

This is called schema-on-write, the shape of your data is defined upfront, and anything that doesn't fit gets rejected. That strictness feels like a constraint until you're the analyst trying to build a board-level revenue report and you actually need to trust the numbers.

Key characteristics:

1. Designed for structured, cleaned, analytics-ready data
1. Strict schema enforcement (schema-on-write)
1. Highly optimized for SQL-based analytical queries
1. Strong governance, security, and access controls
1. Primary consumers are SQL analysts, BI teams, and business stakeholders

Platforms like Snowflake, Google BigQuery, Amazon Redshift, and Azure Synapse are well-known implementations. They excel when your data is already clean and your consumers need fast, reliable SQL access.

My mistake wasn't using Snowflake. It was using it for the wrong stage of the pipeline.

What Is a Lakehouse?

After the Snowflake incident, I started reading about data lakes. The pitch was appealing: store everything cheaply in raw form, figure out structure later.

So I tried that next. We set up an Azure Data Lake, dumped our raw files in - CSVs, JSONs, Parquet, logs and called it a win.

Except six months later, nobody could find anything. Data existed, but nobody trusted it. There was no validation, no versioning, no way to know if what you were querying was the right version of a file. We had built what the industry lovingly calls a data swamp.

The Lakehouse pattern emerged to solve exactly this problem. It takes the cost efficiency and flexibility of object storage, and adds a proper table layer on top using open formats like Delta Lake, Apache Iceberg, or Apache Hudi. You get ACID transactions, schema enforcement, time travel, and SQL access without abandoning the flexibility of raw storage.

Key characteristics:

1. Stores raw, semi-structured, and structured data in a single system
1. Uses open table formats (Delta Lake, Iceberg, Hudi)
1. Supports multiple processing engines like Spark, Python, and SQL
1. Schema can evolve over time as data needs change
1. Supports both engineering pipelines and ML workflows from the same storage layer

Platforms like Databricks and modern cloud-native setups implement this pattern well. It's particularly powerful when your team spans both data engineering and data science — both can work from the same storage layer without stepping on each other.

Key Differences at a Glance

Aspect	Lakehouse	Data Warehouse
Data Type	Raw, semi-structured, and structured	Structured only
Schema Approach	Schema-on-read or evolving	Schema-on-write, strict
Flexibility	High	Moderate
Processing Engines	Spark, Python, SQL	Primarily SQL
Primary Users	Data Engineers, Data Scientists	Analysts, BI teams
Primary Use Cases	Ingestion, transformation, ML	Reporting, dashboards, ad-hoc analytics
Governance Maturity	Developing	Mature, well-established
Storage Cost	Lower (object storage)	Higher (optimized proprietary storage)

When to Use a Lakehouse

Think of the Lakehouse as the engineering zone.

In our case, this is where raw Shopify orders land at 2am, where Mixpanel event logs pile up, where our ML team runs experiments on customer behavior data. It's messy in the best possible way flexible, cheap, and tolerant of the chaos that comes with early-stage data.

Use a Lakehouse when:

You are ingesting raw or semi-structured data from APIs, event streams, IoT devices, or application logs
You need to run transformation and cleaning pipelines before data is analytics-ready
Your team works primarily in Spark or Python
Your schema changes frequently as business or source systems evolve
You are building ML features, training datasets, or experimental models
You need cost-efficient storage for large volumes of data at various stages of processing

If I had started here instead of going straight to Snowflake, I would have saved myself three weeks of firefighting.

When to Use a Data Warehouse

Think of the Data Warehouse as the consumption zone.

Once our data was cleaned and validated in the Lakehouse, we loaded curated datasets into Snowflake and that is when it finally worked the way it was supposed to. Our BI team connected Power BI to it, the finance team ran their monthly reports, and the numbers matched.

Use a Data Warehouse when:

Data has already been transformed and is ready for consumption
Your consumers are SQL analysts or BI teams using tools like Tableau, Looker, or Power BI
You need fast, predictable query performance on large structured datasets
Governance, row-level security, and access controls are critical requirements
You are supporting stable, recurring reports that business decisions depend on

The warehouse isn't where data is processed. It's where processed data is served.

How They Work Together

Here's what nobody tells you early enough: you almost always need both.

Lakehouse and Data Warehouse are not competing choices. They serve different stages of the same data lifecycle. Once we restructured our setup, the flow looked like this:

Raw data lands in the Lakehouse : Shopify orders, Mixpanel events, Zendesk tickets, all of it
Our data engineers transform and clean it using Spark and dbt
Curated, structured datasets are loaded into Snowflake
Power BI and Tableau connect to Snowflake for dashboards and business reporting

The Lakehouse handled the complexity of early-stage data. The Warehouse handled the reliability of what our stakeholders actually saw. Each did what it was best at.

The moment we stopped treating them as alternatives and started treating them as sequential layers, everything clicked.

Choosing Between Them

If you're still unsure, here's the simplest filter I've found: ask who is consuming this data, and in what state.

If the consumer is a data engineer or data scientist working with raw or intermediate data → Lakehouse
If the consumer is an analyst or business user needing clean, structured data for reporting → Data Warehouse
If you have both types of consumers (and most teams do after a few months of growth) → use both, in sequence

The workload determines the architecture. Not preference, not trend, not what a vendor happens to be marketing this quarter.

Conclusion

I wasted a month learning this the hard way. You don't have to.

The Lakehouse gives you flexibility, scale, and support for diverse workloads across engineering and data science. The Data Warehouse gives you structure, query performance, and the governance that business reporting demands.

They're not rivals. They're teammates. And the best data platforms I've seen since don't choose between them — they use each exactly where it belongs, and build the pipeline that connects them.

If you're in the early stages of designing your data platform and figuring out where each piece fits, I'd love to compare notes.

🔗 LinkedIn | GitHub

Microsoft Fabric: One Platform, One Lake, Every Data Workload

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Fri, 01 May 2026 00:00:00 GMT

Modern data teams don't struggle because of a lack of tools - they struggle because of too many.

A typical data stack today might include a cloud data warehouse, an object store, a managed Spark environment, a pipeline orchestration tool, and a BI layer on top. Each powerful on its own. But getting them to work together, moving data across systems, keeping governance consistent, debugging failures across layers often becomes a bigger challenge than the actual data work itself.

I ran into this exact problem while building pipelines across Azure Data Factory, ADLS Gen2, and Synapse. Every hand-off between tools meant another connection to configure, another permission to grant, another place for something to silently break.

Microsoft Fabric takes a different approach, instead of adding another tool to the stack, it brings everything together into a single unified platform. Here's how it actually works.

The Foundation: OneLake

Every component in Fabric is built on top of OneLake, the platform's unified, logical data lake and the single source of truth for your entire Fabric workspace.

Every workload, whether it's a Spark notebook, a SQL warehouse query, a Power BI report, or an ML experiment, reads from and writes to the same underlying storage. No data movement between services. No export-and-reload step when a data scientist needs access to a table a data engineer just built.

OneLake stores everything in Delta Parquet format, an open-source table format that supports ACID transactions, schema enforcement, time travel, and versioning. This matters: your data is not locked into a proprietary format. It's readable by Spark, DuckDB, Pandas, Polars, and most modern query engines outside of Fabric too.

📖 Read more: What is OneLake?

The first time I opened OneLake in my Fabric workspace, what struck me was how everything just appeared, my Lakehouse tables, my warehouse tables, all visible in one file explorer without any registration or sync step. That's when the "one lake" concept clicked for me practically, not just conceptually.

📸 Screenshot: OneLake file explorer from my Fabric workspace — Lakehouse and Warehouse tables visible side by side

Data Engineering: Lakehouses, Spark, and Notebooks

Fabric's data engineering experience is organized around the Lakehouse — a storage construct that combines the flexibility of a data lake with the query capabilities of a data warehouse.

When you create a Lakehouse, you get a two-zone structure:

A Files area for raw, unstructured, or semi-structured data (CSV, JSON, images, logs)
A Tables area where data is stored as managed Delta tables, immediately queryable by SQL, Spark, and Power BI

For transformation workloads, Fabric provides a fully managed Apache Spark environment. You write notebooks in Python, Scala, SQL, or R. Clusters are serverless by default — they start on demand, require no configuration, and shut down automatically when idle.

📖 Read more: Apache Spark in Microsoft Fabric

📸 Screenshot: A Spark notebook from my Fabric workspace — reading raw CSV from the Files zone, writing a clean Delta table to Tables

Coming from standalone Databricks, the Spark notebook experience in Fabric felt noticeably lighter to set up. No cluster configuration, no runtime version juggling, you open a notebook and it just works.

For production workloads, you can promote notebooks to Spark Job Definitions for scheduled execution, and manage library dependencies using Environments, versioned, shareable Spark configurations that eliminate the classic "works on my cluster" problem.

📖 Read more: Fabric Lakehouse overview

Data Ingestion and Orchestration: Data Factory

Getting data from external systems into the Lakehouse is the job of Data Factory, Fabric's data integration and orchestration layer.

Data Factory offers two primary patterns:

Pipelines - The activity-based orchestration tool, familiar to anyone who has used Azure Data Factory or Apache Airflow. You build directed acyclic graphs of copy activities, transformation steps, conditional logic, and triggers. Fabric pipelines support hundreds of connectors to external databases, REST APIs, cloud storage, and SaaS applications.

Dataflows Gen2 - A code-free alternative using a visual, Power Query-based interface. Transformations compile to Spark or SQL execution under the hood, a practical option for analysts who need to express transformation logic without writing code.

📸 Screenshot: A pipeline from my Fabric workspace ingesting from a REST API into the Lakehouse — configured entirely within Fabric, no external ADF instance needed

One thing I genuinely appreciated: neither pipelines nor dataflows require a separate connection configuration to reach your Lakehouse because it's already in the same workspace. You select it from a dropdown. Small thing, big time saver when you're building pipelines daily.

SQL Analytics: The Data Warehouse

Fabric's Data Warehouse is a fully managed T-SQL analytics engine, but with an important architectural distinction. It stores its data in Delta Parquet on OneLake, not in a proprietary internal format.

This means tables written by your Spark notebooks in the Lakehouse are directly readable by warehouse SQL queries and warehouse tables are readable by Spark without any copy or ETL step in between.

A practical decision guide:

Use the Lakehouse when...	Use the Warehouse when...
Workloads are Spark-heavy	Consumers are SQL analysts
Data is schema-flexible	Structured, governed tables are needed
Programmatic transformation logic is required	Strong query performance with SQL semantics is the priority

📸 Screenshot: Querying a Lakehouse Delta table directly from the Fabric Warehouse SQL editor — no data copy needed

Real-Time Intelligence: Streaming and Event Data

Real-Time Intelligence is Fabric's answer to streaming workloads and one of the more complete streaming experiences available within a unified platform.

Eventstreams act as a managed event streaming layer. You connect to sources like Azure Event Hubs, Kafka, or IoT Hub, apply in-flight transformations using a visual stream-processing editor, and route output to multiple destinations simultaneously.

The destination for high-frequency event data is typically an Eventhouse, which contains one or more KQL databases. KQL (Kusto Query Language) is optimized for time-series and log data significantly faster than SQL for streaming analytics queries like "show me anomalies in sensor readings in the last 15 minutes, grouped by device."

Crucially, Eventhouse data also lives in OneLake meaning historical event data can be joined with batch data from the Lakehouse or Warehouse without a separate data movement step.

📖 Read more: Real-Time Intelligence in Microsoft Fabric

Data Science and Machine Learning

Fabric's Data Science experience covers the full ML lifecycle — from exploratory analysis through model training, evaluation, and deployment.

The primary workspace is Jupyter-style notebooks backed by managed Spark, with access to the full Python ML ecosystem (scikit-learn, XGBoost, PyTorch, TensorFlow) and SynapseML for distributed ML on Spark.

Fabric integrates MLflow natively for experiment tracking and model registration. Models can be used for batch scoring directly against Lakehouse tables using the PREDICT function in Spark SQL — no separate serving infrastructure required for batch inference.

The deeper value: feature tables built by data engineers in the Lakehouse are immediately accessible in ML notebooks without copying or re-ingesting data. The gap between data engineering and data science shrinks considerably when both are working against the same underlying tables.

📖 Read more: Data Science in Microsoft Fabric

Security and Governance: Built In

One of the more understated strengths of Fabric's unified architecture is what it enables for governance. When all your data lives in one place, you define access policies once — not once per service.

Fabric integrates with Microsoft Entra ID for identity and access management, and with Microsoft Purview for data cataloging, lineage tracking, and sensitivity labeling. Row-level security, column-level security, and workspace-level access controls are applied uniformly across all Fabric experiences.

A sensitivity label applied to a table in the Lakehouse is respected when that same table is queried from the Warehouse or visualized in Power BI, a significant operational advantage over managing access policies across a fragmented stack.

Power BI: Reporting Without Data Duplication

Power BI is the reporting layer and in Fabric, it gains DirectLake mode, which addresses one of its longest-standing pain points.

Traditionally, Power BI reports could either:

Query live data (slow, puts load on source systems), or
Import data into an in-memory model (fast, but creates a stale copy requiring scheduled refreshes)

DirectLake is a third mode - it reads directly from Delta Parquet files in OneLake at query time, delivering import-speed performance without maintaining a separate copy of the data.

For data engineers, this changes everything. Once your pipeline writes a clean Delta table to the Lakehouse, a Power BI report can query it in DirectLake mode immediately, no refresh schedule, no import process, no synchronization lag.

📸 Screenshot: A Power BI report in DirectLake mode querying my Fabric Lakehouse — always current as of the last pipeline run

Bringing It All Together

The reason Fabric is worth serious evaluation is not any individual component — it's what the unified architecture enables across all of them.

A pipeline in Data Factory writes to a Lakehouse → A Spark notebook transforms it into a clean Delta table → A data scientist trains a model against that table → A warehouse analyst queries it in SQL → A Power BI report visualizes it in DirectLake mode → An Eventstream feeds real-time data into the same Lakehouse alongside batch data. Throughout all of this, Purview tracks lineage and Entra enforces access policies.

None of these steps require a separate connector, a data copy, or a cross-service authentication configuration. They are all reading from OneLake.

For teams that have spent years managing the operational overhead of a fragmented data stack, that's a genuinely meaningful shift, one where the platform handles the integration, and engineers can focus on the work that actually matters.

Try It Yourself

Microsoft Fabric Free Trial → app.fabric.microsoft.com
Full Documentation → learn.microsoft.com/fabric
OneLake Documentation → What is OneLake?
Apache Spark in Fabric → Spark overview
Real-Time Intelligence → RTI overview
Data Science in Fabric → Data Science overview

About the Author

I'm Aditya Singh Rathore, a Data Engineer passionate about building modern, scalable data platforms. I write about Microsoft Fabric, Azure data tools, and real-world data engineering on RecodeHive,breaking down complex concepts into practical, actionable content.

If this article helped you understand Microsoft Fabric better, consider sharing it with your network. And if you're building something with Fabric or just getting started, I'd love to hear about it.

🔗 Connect with me on LinkedIn | GitHub

📩 Have a topic you'd like me to cover? Drop it in the comments below.

OpenAI AgentKit: Building AI Agents Without the Complexity

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Wed, 15 Oct 2025 00:00:00 GMT

Hey there, AI builders! 👋

I still remember the days when building an AI agent meant wrestling with fragmented tools, managing complex API calls, debugging mysterious failures, and spending more time on infrastructure than actual innovation. It felt like trying to build a house while simultaneously manufacturing your own bricks.

That changed on October 6, 2025, when Sam Altman took the stage at OpenAI's Dev Day and unveiled AgentKit - a complete toolkit that promises to transform how we build, deploy, and optimize AI agents. Today, I want to walk you through what makes AgentKit special and why it might be the most significant developer tool launch from OpenAI yet.

What is AgentKit?

AgentKit is described by OpenAI CEO Sam Altman as a comprehensive set of building blocks designed to help developers take agents from prototype to production. But that simple description doesn't do it justice.

Think of AgentKit as the unified development platform that the AI agent ecosystem has been desperately needing. Instead of piecing together multiple tools, APIs, and services from different providers, you get everything in one coherent package that actually works together.

The promise? Build, deploy, and optimize agent workflows with significantly less friction.

Why AgentKit Matters Now

Before we dive into the components, let's talk about timing. OpenAI's ChatGPT has reached 800 million weekly active users, making it one of the most widely used AI platforms in history. This massive user base represents an equally massive opportunity for developers to build AI-powered solutions.

The launch signals OpenAI's competitive move against other AI platforms racing to offer integrated tools for building autonomous agents that can perform complex tasks, not just respond to prompts. We're witnessing the shift from conversational AI to truly agentic AI - systems that can take action, use tools, and accomplish multi-step goals autonomously.

The Four Pillars of AgentKit

AgentKit isn't just one tool - it's a complete ecosystem built around four core capabilities. Let's explore each one and understand how they work together.

1. Agent Builder: The Visual Workflow Editor

Altman described Agent Builder as "like Canva for building agents" - a fast, visual way to design the logic, steps, and ideas.

This is the headline feature that's getting everyone excited, and for good reason. Remember when website builders transformed from hand-coding HTML to drag-and-drop interfaces? Agent Builder does the same thing for AI agent development.

What Agent Builder Does:

Provides a visual canvas for designing agent workflows
Uses drag-and-drop components to define agent logic
Built on top of the Responses API that hundreds of thousands of developers already use
Eliminates the need to write boilerplate code for common agent patterns

Why This Matters: Here's the thing - even experienced developers spend a disproportionate amount of time on scaffolding and infrastructure when building agents. Agent Builder abstracts away the repetitive parts while still giving you control over the important decisions.

The Power of Visual Design: When you can see your agent's workflow as a visual graph, you can:

Spot logical errors before they become runtime bugs
Understand complex conditional flows at a glance
Iterate faster by rearranging components visually
Collaborate with non-technical stakeholders who can understand the visual representation

Think of it this way: If traditional agent development is like writing assembly code, Agent Builder is like using a modern IDE with IntelliSense, debugger, and visual tools all built in.

2. ChatKit: Embeddable Chat Interfaces Made Simple

The second pillar of AgentKit is ChatKit - and this is where things get really practical for product builders.

What ChatKit Provides: A simple embeddable chat interface that developers can use to bring chat experiences into their own apps, with the ability to bring your own brand, workflows, and whatever makes your product unique.

Why ChatKit Is Brilliant: Building a good chat interface is harder than it looks. You need to handle:

Message threading and history
Streaming responses for better UX
Error handling and retry logic
Mobile responsiveness
Accessibility features
Loading states and animations

ChatKit handles all of this out of the box, but here's the clever part - it's not a black box. You can customize it to match your brand, inject your own business logic, and integrate it seamlessly into existing applications.

The beauty is that you're not starting from scratch. You're building on a foundation that's been battle-tested by millions of users in ChatGPT.

3. Evals for Agents: Measuring What Matters

This is where AgentKit gets serious about production deployments. Anyone can build a demo that works once. Building something reliable enough to bet your business on requires rigorous evaluation.

What Evals for Agents Includes: Tools to measure AI agent performance, including step-by-step trace grading, datasets for assessing individual agent components, automated prompt optimization, and the ability to run evaluations on external models.

The Evaluation Challenge: Here's what makes evaluating AI agents tricky:

Unlike traditional software, agents are probabilistic - they might behave differently each time
Success isn't binary - there are degrees of correctness
Complex workflows have multiple failure points
Optimization in one area might break something else

How Evals for Agents Solves This:

Step-by-Step Trace Grading: Instead of just looking at final outputs, you can evaluate each step in your agent's reasoning process. This is game-changing for debugging. When something goes wrong, you can pinpoint exactly which step failed and why.

Component-Level Datasets: You can create evaluation datasets for individual components of your agent. This modular approach means you can improve specific parts without worrying about breaking the whole system.

Automated Prompt Optimization: Prompt engineering is more art than science, but it doesn't have to be. With automated optimization, you can test variations systematically and let data drive your decisions.

Cross-Model Evaluation: The ability to run evaluations on external models directly from the OpenAI platform is subtle but powerful. It means you can compare performance across different models, optimize for cost vs. quality, and make informed decisions about model selection.

4. Connector Registry: Secure Integration at Scale

The fourth pillar ties everything together by solving one of the thorniest problems in enterprise AI: secure, controlled access to internal tools and external services.

What the Connector Registry Provides: Developers can securely connect agents to internal tools and third-party systems through an admin control panel while maintaining security and control.

Why This Matters for Enterprises: When I talk to enterprise developers, the same concerns come up repeatedly:

How do we give AI agents access to our systems without compromising security?
How do we audit what agents are doing with sensitive data?
How do we revoke access quickly if needed?
How do we comply with regulatory requirements?

The Connector Registry addresses all of these with a centralized, controlled approach to integrations.

The Security Model:

Centralized admin control panel for managing all connections
Granular permissions at the agent and tool level
Audit logs for compliance and debugging
Easy revocation and rotation of credentials
Support for OAuth and other enterprise authentication methods

The Developer Experience: For developers, it's beautifully simple. Instead of managing API keys in environment variables and writing custom integration code, you:

Select the connector you need from the registry
Authenticate through the admin panel
Use it in your agent with a simple reference

The platform handles the rest - credential management, retries, rate limiting, and error handling.

Seeing Is Believing: The Live Demo

One of the most compelling moments from Dev Day was when OpenAI engineer Christina Huang built an entire AI workflow and two AI agents live onstage in under eight minutes.

Let me repeat that: under eight minutes. From zero to a working multi-agent system.

This wasn't a pre-recorded demo with everything perfectly set up. This was live, unscripted development that showed what's possible when you remove unnecessary friction from the development process.

What would that same task have taken before AgentKit? Probably hours of coding, debugging, and testing. And that's if you're an experienced AI developer who knows all the APIs and best practices.

How the Components Work Together

Now that we've covered the four pillars individually, let's see how they create a unified development experience:

The Development Flow

Step 1: Design Your Agent Start in Agent Builder, visually mapping out your agent's workflow. Define the steps, decision points, and tool usage without writing any code.

Step 2: Connect Your Tools Use the Connector Registry to securely link your agent to the services it needs - databases, APIs, internal tools, whatever your use case requires.

Step 3: Add the Interface Integrate ChatKit to give your users a polished way to interact with your agent. Customize it to match your brand and product experience.

Step 4: Evaluate and Optimize Use Evals for Agents to measure performance, identify weaknesses, and systematically improve your agent's reliability.

Step 5: Deploy and Monitor Push to production with confidence, knowing you have the evaluation framework to catch issues and the tools to iterate quickly.

The Iteration Loop

Here's where the integrated approach really shines. Traditional development has a slow feedback loop:

Write code
Deploy to test environment
Manually test
Find bugs
Fix bugs
Repeat

With AgentKit, the loop is much tighter:

Adjust agent visually in Agent Builder
Run automated evals
See results immediately
Iterate based on data

This faster iteration cycle means you can explore more possibilities, validate assumptions quickly, and get to production-ready faster.

The Philosophy Behind AgentKit

Altman noted that AgentKit is "all the stuff that we wished we had when we were trying to build our first agents". This statement reveals something important about OpenAI's approach.

AgentKit wasn't designed in a vacuum by people who don't build with AI. It was designed by the same team that's been building ChatGPT, GPT-4, and other cutting-edge AI systems. They've felt the pain points, hit the roadblocks, and now they're sharing the solutions they wish they'd had.

Opinionated But Flexible

AgentKit makes strong opinions about the right way to build agents:

Visual design over code-first approaches
Evaluation-driven development over manual testing
Secure, centralized integrations over scattered API keys
Component reusability over monolithic builds

But these opinions don't lock you in. Agent Builder is built on top of the Responses API that hundreds of thousands of developers already use, which means you can drop down to code when you need more control.

Production-Ready from Day One

Many developer tools focus on getting you to "hello world" quickly but leave you on your own for production concerns. AgentKit takes the opposite approach - it's designed for production from the start.

The inclusion of Evals, the Connector Registry with admin controls, and the focus on security and reliability all signal that this isn't a toy for prototypes. It's infrastructure for building real businesses on.

Who Benefits Most from AgentKit?

Individual Developers

If you're a solo developer with an idea for an AI-powered product, AgentKit dramatically lowers the barrier to entry. You don't need a team of ML engineers and DevOps specialists. You can build, evaluate, and deploy agents yourself.

Startups

For startups, AgentKit means faster time to market and lower development costs. Instead of spending months on infrastructure, you can focus on your unique value proposition and get to product-market fit faster.

Enterprise Teams

OpenAI has already signed on several launch partners that have scaled agents using AgentKit. For enterprises, the value is in the security model, evaluation framework, and ability to standardize on a single platform across teams.

Non-Technical Founders

Here's a bold prediction: AgentKit will enable non-technical founders to build AI products that would have previously required a technical co-founder. The visual nature of Agent Builder, combined with the pre-built components, puts agent development within reach of anyone willing to learn.

The Competitive Landscape

The launch highlights OpenAI's push to increase developer adoption by making agent building faster and easier, and signals a competitive move against other AI platforms racing to offer integrated tools.

The AI infrastructure space is heating up, with players like:

LangChain providing agent frameworks
AutoGen offering multi-agent systems
Anthropic's Claude with computer use
Numerous startups building agent platforms

What makes AgentKit different is the integration. While other tools focus on one piece of the puzzle, AgentKit provides the whole solution - from design to deployment to evaluation.

Best Practices for Building with AgentKit

Based on what we know about AgentKit and agent development in general, here are some principles to keep in mind:

Start Simple, Then Expand

Don't try to build a complex multi-agent system on day one. Start with a single, focused agent that does one thing well. Use Evals to make sure it's reliable, then add complexity gradually.

Evaluation-Driven Development

Make evaluation a first-class part of your development process. Create eval datasets before you build, not after. This forces you to think clearly about what success looks like.

Embrace the Visual Paradigm

If you're a code-first developer, give the visual builder a real chance. It might feel awkward at first, but the benefits of being able to see your agent's logic at a glance are substantial.

Security First

Use the Connector Registry's admin controls from the start. Don't cut corners on security even in development. It's much harder to add security later than to build it in from the beginning.

Iterate Based on Real Usage

Deploy early (to a small audience) and let real usage guide your improvements. The evaluation tools will help you identify where your agent is struggling with actual user queries.

The Future of Agent Development

AgentKit represents a bet on the future of software development. OpenAI is betting that:

Agents will be everywhere - Not just chatbots, but agents handling complex workflows across industries
Visual tools will dominate - The future of development is more visual, more accessible, and less code-heavy
Evaluation matters - As agents become critical infrastructure, systematic evaluation becomes non-negotiable
Integration is key - The value is in connecting AI to your existing tools and data, not just in the AI itself

I think they're right on all counts.

Challenges and Considerations

Of course, no tool is perfect. Here are some things to keep in mind:

Vendor Lock-In

Building on AgentKit means building on OpenAI's platform. While you can run evaluations on external models, you're still deeply integrated with OpenAI's ecosystem. Make sure you're comfortable with that dependency.

Learning Curve

While AgentKit aims to make agent development easier, there's still a learning curve. Understanding how to design effective agent workflows, write good evaluation criteria, and optimize for production takes time and practice.

Cost Considerations

Using AI at scale isn't free. Make sure you understand the pricing model and factor in API costs when planning your application.

Limits of Automation

Agent Builder is powerful, but it can't replace deep thinking about your problem domain. You still need to understand your users, design good workflows, and make strategic decisions.

Getting Started

Ready to dive in? Here's how to get started with AgentKit:

Explore the Documentation - OpenAI's documentation is comprehensive and includes tutorials for common use cases
Start with Templates - Don't build from scratch if you don't have to. Start with templates and modify them for your needs
Join the Community - Connect with other developers building with AgentKit. Share patterns, ask questions, and learn from others here : https://community.openai.com/
Build in Public - Share your progress and learnings. The community grows stronger when we share knowledge

Conclusion: The Agent Era Begins

AgentKit isn't just another developer tool - it's OpenAI's vision for how AI agent development should work. By removing friction, providing integrated tools, and making evaluation a first-class concern, AgentKit makes it possible for far more people to build production-grade AI agents.

Altman's statement that this is "all the stuff we wished we had when we were trying to build our first agents" resonates because it comes from real experience. This isn't theoretical - it's battle-tested approaches packaged for everyone.

Whether you're a seasoned AI developer looking to build faster, a startup trying to find product-market fit, or an enterprise scaling AI across your organization, AgentKit provides the foundation you need.

The question isn't whether agents will transform how we build software - they already are. The question is whether you'll be part of that transformation. With AgentKit, the barrier to entry has never been lower.

The future of software is agentic, and AgentKit is your toolkit for building it. The only question left is: what will you build? 🚀

GitHub Copilot CLI: Public Preview

sanjay@recodehive.com (Sanjay Viswanthan) — Wed, 17 Sep 2025 00:00:00 GMT

GitHub Copilot CLI is now in public preview GitHub bought power of GitHub Copilot coding agent directly to your terminal, with GitHub Copilot CLI, you can work locally and synchronously with an AI agent that understands your code and GitHub context in depth.

📖 Overview

GitHub Copilot CLI is now in public preview, and it’s designed to bring AI-powered development right to your command line. You can work locally and synchronously with an AI agent that understands your code and GitHub context no IDE switching required.

✨Key features:

✅Terminal-native dev – Use the Copilot coding agent directly in your terminal.
✅GitHub integration – Work with repositories, issues, and pull requests using llm.
✅Agentic capabilities – Build, edit, debug, and refactor code with AI.
✅MCP-powered extensibility – Extend with custom MCP servers.
✅Full control – Every action requires your explicit approval.

Plus, extend Copilot CLI's capabilities and context through custom MCP servers. Agent-powered, GitHub-native Execute coding tasks with an agent that knows your repositories, issues, and pull requests — all natively in your terminal.

📦 Getting Started

Supported Platforms

✅Linux
✅macOS
✅Windows (experimental)

Prerequisites

⚙️Node.js v22+
⚙️npm v10+
⚙️PowerShell v6+ (Windows only)
⚙️Active GitHub Copilot subscription (Pro, Pro+, Business, or Enterprise)

You can install the latest version of the powershell using this command and check the version as mentioned above it should be more than V6.

winget install Microsoft.PowerShell

pwsh --version

If you have access to GitHub Copilot via your organization of enterprise, you cannot use GitHub Copilot CLI if your organization owner or enterprise administrator has disabled it in the organization or enterprise settings. See Managing policies and features for GitHub Copilot in your organization for more information.

💽 Installation

Install globally with npm: Powered by the same agentic harness as GitHub's Copilot coding agent, it provides intelligent assistance while staying deeply integrated with your GitHub workflow. Enter the prompt in the command line.

npm install -g @github/copilot

Verify installation: the below command will run the banner start image of GitHub Copilot.

copilot --banner

Authenticate with your GitHub account: If you're not currently logged in to GitHub, you'll be prompted to use the /login slash command. Enter this command and follow the on-screen instructions to authenticate.

/login

Or authenticate using a Personal Access Token (PAT):

You can also authenticate using a fine-graned PAT with the "Copilot Rrequests" permission enabled. Visit https://github.com/settings/personal-access-tokens/new Under Permissions," click add permissions and select Copilot Requests Generate your token Add the token to your environment via the environment variable GH_TOKEN or GITHUB_TOKEN.👇🏻

# Linux/macOS
export GH_TOKEN=your_token_here  

# Windows
setx GH_TOKEN your_token_here

🖥️ Usage

Once installed, run copilot on your terminal, Image of the splash screen for the Copilot CLI. The usage is pretty straight forward you can use the arrow keys to navigate to proceed cancel instruction etc.

Each time you submit a prompt to GitHub Copilot CLI, your monthly quota of premium requests is reduced by one. For information about premium requests, https://docs.github.com/en/copilot/concepts/billing/copilot-requests

Launch Copilot CLI in a project folder:

copilot

By default, it runs Claude Sonnet 4. To switch to GPT-5:

# Linux/macOS
COPILOT_MODEL=gpt-5 copilot

# Windows
set COPILOT_MODEL=gpt-5

Version checking and Exit CLI

copilot --version

Exit anytime with:

Ctrl + C (twice)

Get Suggestions for Common Dev Tasks

Now let's get started with development, here fork this repo and activate GitHub CLI and enter the below bash commands. Website

List of all commands in CLI

I have linked the offical website repo to log any bugs or do direct PR. GitHub CLI repo and Official Documentation

alias api attestation auth browse cache co codespace completion config extension gist gpg-key issue label org pr preview project release repo ruleset run search secret ssh-key status variable workflow

For preview to run enter the following command. 👇🏻

Documentation

gh copilot suggest "create new documentation page in docusaurus"
gh copilot suggest "organize documentation with sidebars"
gh copilot suggest "create code of conduct for repository"

Git Workflow

gh copilot suggest "create feature branch for new blog post"
gh copilot suggest "commit changes to blog content"
gh copilot suggest "create pull request for documentation updates"

Repository Maintenance

gh copilot suggest "check repository status and pending changes"
gh copilot suggest "merge feature branch after review"

Testing & Quality

gh copilot suggest "run linting checks on typescript files"
gh copilot suggest "fix build errors in docusaurus"

Package Management

gh copilot suggest "update docusaurus to latest version"

Development

gh copilot suggest "start development server for docusaurus"
gh copilot suggest "build docusaurus site for production"
gh copilot suggest "deploy docusaurus site"

SEO and metadata

gh copilot suggest "optimize SEO for docusaurus website"
gh copilot suggest "add metadata to blog posts"

🔗 Resources

✅ Final Verdict

GitHub Copilot CLI is the next step in developer productivity bringing AI assistance natively to your terminal. With support for repositories, workflows, testing, and documentation, it simplifies development without taking control away from you.

Less setup, more shipping.

N8N: The Future of Workflow Automation

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Wed, 17 Sep 2025 00:00:00 GMT

Hey automation enthusiasts! 🤖

I still remember the moment when I first connected OpenAI's GPT to a Google Sheets workflow in N8N. What started as a simple data processing task suddenly became an intelligent system that could analyze customer feedback, categorize it by sentiment, and automatically generate personalized responses. It was like watching automation evolve from basic "if-this-then-that" logic to something that could actually think.

Today, I want to take you through the fascinating world of N8N AI workflows - how they work, why they're game-changing, and how you can build your own intelligent automation systems that would have seemed like magic just a few years ago.

What is N8N AI Automation?

N8N (pronounced "n-eight-n") is a powerful workflow automation tool that's taken the integration world by storm. But when you add AI capabilities into the mix, something beautiful happens - your workflows stop being simple data pipelines and start becoming intelligent decision-making systems.

Think of traditional automation as a skilled assembly line worker: fast, reliable, but limited to predefined tasks. N8N AI workflows are more like having a smart assistant who can read, understand, analyze, and make contextual decisions while still maintaining the speed and reliability of automation.

The magic lies in combining N8N's visual workflow builder with AI services like OpenAI, Google's AI Platform, or even custom machine learning models to create workflows that can:

Understand natural language
Make complex decisions based on context
Generate human-like responses
Analyze patterns in data
Adapt to new situations

The Architecture: Visual Workflows Meet AI Intelligence

When you look at an N8N AI workflow, you're seeing a visual representation of an intelligent automation pipeline. Let's break down the key components:

1. Trigger Nodes: The Starting Point

Every N8N workflow begins with a trigger - the event that sets everything in motion:

Webhook Triggers:

HTTP requests from external applications
Perfect for real-time integrations
Can receive data from forms, apps, or other systems

Schedule Triggers:

Time-based automation (cron jobs made visual)
Great for periodic data processing
Can run daily reports, weekly summaries, etc.

App Triggers:

Direct integration with services (Gmail, Slack, Salesforce)
Event-driven automation (new email, message, record created)
Real-time responsiveness to external changes

Manual Triggers:

On-demand execution
Perfect for testing and ad-hoc processing

2. Data Processing Nodes: The Workhorses

These nodes handle the heavy lifting of data transformation and routing:

HTTP Request Nodes:

Connect to any REST API
Fetch data from external services
Send processed results to other systems

Function Nodes:

Custom JavaScript execution
Complex data manipulation
Custom business logic implementation

Conditional Logic Nodes:

IF/THEN/ELSE branching
Route data based on conditions
Create intelligent decision trees

Data Transformation Nodes:

Filter, sort, and reshape data
Extract specific fields
Combine data from multiple sources

3. AI Integration Nodes: The Intelligence Layer

This is where the magic happens - nodes that bring artificial intelligence into your workflows:

OpenAI Nodes:

GPT for text generation and analysis
DALL-E for image generation
Embeddings for semantic search
Fine-tuned models for specific tasks

Google AI Nodes:

Natural Language Processing
Translation services
Vision AI for image analysis
AutoML integration

Anthropic Claude Nodes:

Advanced reasoning and analysis
Long-form content generation
Code assistance and review

Custom AI Model Nodes:

Integration with your own ML models
TensorFlow and PyTorch model serving
Edge AI deployment

4. Output Nodes: The Final Destination

Where your processed, AI-enhanced data ends up:

Database Nodes:

Store results in PostgreSQL, MySQL, MongoDB
Build intelligent data lakes
Create audit trails

Notification Nodes:

Send Slack messages, emails, SMS
Create intelligent alerting systems
Deliver personalized communications

File System Nodes:

Generate reports, documents, images
Store processed data files
Create automated deliverables

How AI Transforms Traditional Workflows

Let me show you the difference between traditional automation and AI-powered workflows with a real example:

Traditional Workflow: Simple Customer Support Ticket Routing

New Email → Extract Sender → Check Department → Forward to Team → Done

This works, but it's rigid. What if the email is about multiple departments? What if the subject line is unclear?

AI-Enhanced Workflow: Intelligent Customer Support

New Email → AI Analysis (Extract Intent, Sentiment, Urgency) → 
Smart Routing (Consider Context, History, Workload) → 
Generate Response Draft → Human Review → Send Personalized Response

The AI version can:

Understand the actual meaning behind customer messages
Consider emotional context (frustrated vs. curious customers)
Route based on content, not just keywords
Generate contextual response drafts
Learn from previous interactions

Core AI Workflow Patterns

After building dozens of AI workflows, I've identified several powerful patterns that you can adapt for almost any use case:

1. The Content Intelligence Pipeline

Use Case: Automatically process and categorize incoming content

Flow Structure:

Content Trigger → AI Content Analysis → Categorization → 
Sentiment Analysis → Keyword Extraction → Storage + Routing

Real-World Applications:

Social media monitoring and response
Customer feedback processing
Content moderation and filtering
News article categorization

2. The Decision Intelligence Framework

Use Case: Make complex decisions based on multiple data sources

Flow Structure:

Data Collection → AI Analysis → Risk Assessment → 
Decision Matrix → Automated Action + Human Notification

Real-World Applications:

Loan approval workflows
Inventory restocking decisions
Quality control assessment
Investment recommendations

3. The Communication Intelligence System

Use Case: Generate and personalize communications at scale

Flow Structure:

Trigger Event → Context Gathering → AI Content Generation → 
Personalization → Multi-Channel Delivery → Response Tracking

Real-World Applications:

Personalized marketing campaigns
Customer onboarding sequences
Support ticket responses
Sales follow-up automation

4. The Data Intelligence Engine

Use Case: Extract insights and patterns from large datasets

Flow Structure:

Data Ingestion → AI Analysis → Pattern Recognition → 
Insight Generation → Visualization → Action Recommendations

Real-World Applications:

Sales trend analysis
Customer behavior prediction
Operational efficiency optimization
Risk pattern detection

Real-World Use Cases and Success Stories

Here are some powerful AI workflows I've seen in production:

1. E-commerce Intelligence Platform

Challenge: Online store receiving thousands of product reviews daily Solution: AI workflow that analyzes reviews, extracts insights, and automatically updates product descriptions

Results:

95% reduction in manual review processing time
40% improvement in product page conversion rates
Automatic identification of product issues before they become major problems

2. HR Recruitment Automation

Challenge: Screening hundreds of resumes for multiple positions Solution: AI workflow that analyzes resumes, matches them to job requirements, and generates personalized outreach

Results:

80% reduction in initial screening time
60% improvement in candidate-job fit quality
Personalized communication that increased response rates by 45%

3. Financial Risk Assessment

Challenge: Manually reviewing loan applications across multiple criteria Solution: AI workflow that combines financial data analysis with behavioral pattern recognition

Results:

70% faster decision-making process
25% improvement in risk prediction accuracy
Consistent evaluation criteria across all applications

4. Content Marketing Automation

Challenge: Creating personalized content for different audience segments Solution: AI workflow that analyzes audience data and generates tailored content automatically

Results:

10x increase in content production capacity
35% improvement in engagement rates
Consistent brand voice across all generated content

The Integration Ecosystem: N8N's Superpower

What makes N8N AI workflows truly powerful is the vast ecosystem of integrations available:

Popular Service Integrations:

Communication Platforms:

Slack, Discord, Microsoft Teams
Email (Gmail, Outlook, SendGrid)
SMS (Twilio, Amazon SNS)

Data Stores:

Google Sheets, Airtable
Databases (PostgreSQL, MySQL, MongoDB)
Cloud Storage (Google Drive, Dropbox, AWS S3)

Business Applications:

CRM (Salesforce, HubSpot, Pipedrive)
Project Management (Notion, Asana, Jira)
E-commerce (Shopify, WooCommerce)

AI and ML Services:

OpenAI (GPT, DALL-E, Whisper)
Google AI (Vision, Language, Translation)
AWS AI (Comprehend, Rekognition, Textract)
Custom ML models via API

Creating Intelligent Integration Chains:

Salesforce Lead → AI Qualification → Google Sheets Update → 
Slack Notification → Email Sequence → Calendar Booking → 
Follow-up Automation

Each step can be enhanced with AI intelligence, creating a seamless experience that feels magical to end users.

Future Trends: Where AI Workflows Are Heading

The world of AI automation is evolving rapidly. Here are the trends I'm watching:

Workflows that can process text, images, audio, and video in the same pipeline:

Voice Input → Speech-to-Text → Intent Analysis → 
Image Processing → Decision Making → Multi-Format Response

2. Autonomous Workflow Optimization

AI systems that can optimize their own workflows:

Automatically adjust parameters based on performance
Suggest new integration opportunities
Identify bottlenecks and propose solutions

3. Collaborative AI Workflows

Multiple AI agents working together within a single workflow:

Specialist AIs for different domains
Consensus-building among AI models
Dynamic role assignment based on task requirements

4. Edge AI Integration

Running AI models directly within N8N workflows:

Reduced latency and costs
Enhanced privacy and data security
Offline operation capabilities

Getting Started: Your AI Workflow Journey

Ready to build your first AI workflow? Here's your roadmap:

Phase 1: Foundation Building (Week 1-2)

Set up N8N (self-hosted or cloud)
Create your first simple workflow without AI
Learn the basic nodes and flow patterns
Connect to your most-used services

Phase 2: AI Integration (Week 3-4)

Add your first AI node (start with OpenAI)
Build a simple text analysis workflow
Experiment with different prompts and parameters
Learn prompt engineering basics

Phase 3: Advanced Patterns (Month 2)

Implement conditional logic based on AI results
Create multi-step AI processing workflows
Add error handling and fallback logic
Optimize for performance and cost

Phase 4: Production Deployment (Month 3)

Monitor and log workflow performance
Implement proper security measures
Create comprehensive documentation
Train your team on workflow management

Resources to Accelerate Your Learning:

Documentation and Tutorials:

N8N official documentation and community forum
AI service provider documentation (OpenAI, Google AI, etc.)
Workflow template galleries and examples

Community and Support:

N8N Discord community
GitHub repositories with example workflows
YouTube tutorials and case studies

Best Practice Guides:

Security considerations for API keys and sensitive data
Performance optimization techniques
Cost management strategies

Conclusion: The Future is Intelligent Automation

AI workflows in N8N represent a fundamental shift in how we think about automation. We're moving from rigid, rule-based systems to intelligent, adaptive processes that can understand context, make decisions, and learn from experience.

The beauty of this technology lies not just in its technical capabilities, but in how it democratizes artificial intelligence. You don't need to be a data scientist or machine learning engineer to build sophisticated AI systems. With N8N's visual interface and the growing ecosystem of AI services, anyone can create intelligent automation that would have required a team of specialists just a few years ago.

Whether you're automating customer service, processing business data, generating content, or solving domain-specific challenges, AI workflows give you the power to build systems that are not just fast and reliable, but genuinely intelligent.

The future belongs to organizations that can seamlessly blend human creativity with artificial intelligence, and N8N AI workflows are the bridge that makes this possible. So start small, experiment freely, and prepare to be amazed by what you can build when you combine the power of automation with the intelligence of AI.

The next time someone asks you about the future of automation, show them an N8N AI workflow in action. Watch their expression change from skepticism to wonder as they realize we're not just talking about the future anymore - we're living in it. Happy automating!

Spark Architecture Explained

rathoreadityasingh30@gmail.com (Aditya Singh Rathore) — Fri, 22 Aug 2025 00:00:00 GMT

Hey there, fellow data enthusiasts! 👋

I remember the first time I encountered a Spark architecture diagram. It looked like a complex web of boxes and arrows that seemed to communicate in some secret distributed computing language. But once I understood what each component actually does and how they work together, everything clicked into place.

Today, I want to walk you through Spark's architecture in a way that I wish someone had explained it to me back then - focusing on the core components and how this beautiful system actually works under the hood.

What is Apache Spark?

Before diving into the architecture, let's establish what we're dealing with. Apache Spark is an open-source, distributed computing framework designed to process massive datasets across clusters of computers. Think of it as a coordinator that can take your data processing job and intelligently distribute it across multiple machines to get the work done faster.

The key insight that makes Spark special? It keeps data in memory between operations whenever possible, which is why it can be dramatically faster than traditional batch processing systems.

The Big Picture: High-Level Architecture

When you look at Spark's architecture, you're essentially looking at a well-orchestrated system with three main types of components working together:

Driver Program - The mastermind that coordinates everything
Cluster Manager - The resource allocator
Executors - The workers that do the actual processing

Let's break down each of these and understand how they collaborate.

Core Components Deep Dive

1. The Driver Program: Your Application's Brain

The Driver Program is where your Spark application begins and ends. When you write a Spark program and run it, you're essentially creating a driver program. Here's what makes it the brain of the operation:

What the Driver Does:

Contains your main() function and defines RDDs(Resilient Distributed Datasets) and operations on them
Converts your high-level operations into a DAG (Directed Acyclic Graph) of tasks
Schedules tasks across the cluster
Coordinates with the cluster manager to get resources
Collects results from executors and returns final results

Think of it this way: If your Spark application were a restaurant, the Driver would be the head chef who takes orders (your code), breaks them down into specific cooking tasks, assigns those tasks to kitchen staff (executors), and ensures everything comes together for the final dish.

The driver runs in its own JVM(Java Virtual Machine) process and maintains all the metadata about your Spark application throughout its lifetime.

2. Cluster Manager: The Resource Referee

The Cluster Manager sits between your driver and the actual compute resources. Its job is to allocate and manage resources across the cluster. Spark is flexible and works with several cluster managers:

Standalone Cluster Manager:

Spark's built-in cluster manager
Simple to set up and understand
Great for dedicated Spark clusters

Apache YARN (Yet Another Resource Negotiator):

Hadoop's resource manager
Perfect if you're in a Hadoop ecosystem
Allows resource sharing between Spark and other Hadoop applications

Apache Mesos:

A general-purpose cluster manager
Can handle multiple frameworks beyond just Spark
Good for mixed workload environments

Kubernetes:

The modern container orchestration platform
Increasingly popular for new deployments
Excellent for cloud-native environments

The key point: The cluster manager's job is resource allocation - it doesn't care what your application does, just how much CPU and memory it needs.

3. Executors: The Workhorses

Executors are the processes that actually run your tasks and store data for your application. Each executor runs in its own JVM process and can run multiple tasks concurrently using threads.

What Executors Do:

Execute tasks sent from the driver
Store computation results in memory or disk storage
Provide in-memory storage for cached RDDs/DataFrames
Report heartbeat and task status back to the driver

Key Characteristics:

Each executor has a fixed number of cores and amount of memory
Executors are launched at the start of a Spark application and run for the entire lifetime
If an executor fails, Spark can launch new ones and recompute lost data

Think of executors as skilled workers in our restaurant analogy - they can handle multiple cooking tasks simultaneously and have their own workspace (memory) to store ingredients and intermediate results.

How These Components Work Together: The Execution Flow

Now that we know the players, let's see how they orchestrate a typical Spark application:

Step 1: Application Submission

When you submit a Spark application, the driver program starts up and contacts the cluster manager requesting resources for executors.

Step 2: Resource Allocation

The cluster manager examines available resources and launches executor processes on worker nodes across the cluster.

Step 3: Task Planning

The driver analyzes your code and creates a logical execution plan. It breaks down operations into stages and tasks that can be executed in parallel.

Step 4: Task Distribution

The driver sends tasks to executors. Each task operates on a partition of data, and multiple tasks can run in parallel across different executors.

Step 5: Execution and Communication

Executors run the tasks, storing intermediate results and communicating progress back to the driver. The driver coordinates everything and handles any failures.

Step 6: Result Collection

Once all tasks complete, the driver collects results and returns the final output to your application.

Understanding RDDs: The Foundation

At the heart of Spark's architecture lies the concept of Resilient Distributed Datasets (RDDs). Understanding RDDs is crucial to understanding how Spark actually works.

What makes RDDs special:

Resilient: RDDs can automatically recover from node failures. Spark remembers how each RDD was created (its lineage) and can rebuild lost partitions.

Distributed: RDD data is automatically partitioned and distributed across multiple nodes in the cluster.

Dataset: At the end of the day, it's still just a collection of your data - but with superpowers.

RDD Operations: Transformations vs Actions

RDDs support two types of operations, and understanding the difference is crucial:

Transformations (Lazy):

val filtered = data.filter(x => x > 10)
val mapped = filtered.map(x => x * 2)
val grouped = mapped.groupByKey()

These operations don't actually execute immediately. Spark just builds up a computation graph.

Actions (Eager):

val results = grouped.collect()  // Brings data to driver
val count = filtered.count()     // Returns number of elements
grouped.saveAsTextFile("hdfs://...")  // Saves to storage

Actions trigger the actual execution of all the transformations in the lineage.

This lazy evaluation allows Spark to optimize the entire computation pipeline before executing anything.

The DAG: Spark's Optimization Engine

One of Spark's most elegant features is how it converts your operations into a Directed Acyclic Graph (DAG) for optimal execution.

How DAG Optimization Works

When you chain multiple transformations together, Spark doesn't execute them immediately. Instead, it builds a DAG that represents the computation. This allows for powerful optimizations:

Pipelining: Multiple transformations that don't require data shuffling can be combined into a single stage and executed together.

Stage Boundaries: Spark creates stage boundaries at operations that require data shuffling (like groupByKey, join, or repartition).

Stages and Tasks Breakdown

Stage: A set of tasks that can all be executed without data shuffling. All tasks in a stage can run in parallel.

Task: The smallest unit of work in Spark. Each task processes one partition of data.

Wide vs Narrow Dependencies:

Narrow Dependencies: Each partition of child RDD depends on a constant number of parent partitions (like map, filter)
Wide Dependencies: Each partition of child RDD may depend on multiple parent partitions (like groupByKey, join)

Wide dependencies create stage boundaries because they require shuffling data across the network.

Memory Management: Where the Magic Happens

Spark's memory management is what gives it the speed advantage over traditional batch processing systems. Here's how it works:

Memory Regions

Spark divides executor memory into several regions:

Storage Memory (60% by default):

Used for caching RDDs/DataFrames
LRU eviction when space is needed
Can borrow from execution memory when available

Execution Memory (20% by default):

Used for computation in shuffles, joins, sorts, aggregations
Can borrow from storage memory when needed

User Memory (20% by default):

For user data structures and internal metadata
Not managed by Spark

Reserved Memory (300MB by default):

System reserved memory for Spark's internal objects

The beautiful thing about this system is that storage and execution memory can dynamically borrow from each other based on current needs.

The Unified Stack: Multiple APIs, One Engine

What makes Spark truly powerful is that it provides multiple high-level APIs that all run on the same core engine:

Spark Core

The foundation that provides:

Basic I/O functionality
Task scheduling and memory management
Fault tolerance
RDD abstraction

Spark SQL

SQL queries on structured data
DataFrame and Dataset APIs
Catalyst query optimizer
Integration with various data sources

Spark Streaming

Real-time stream processing
Micro-batch processing model
Integration with streaming sources like Kafka

MLlib

Distributed machine learning algorithms
Feature transformation utilities
Model evaluation and tuning

GraphX

Graph processing and analysis
Built-in graph algorithms
Graph-parallel computation

The key insight: all of these APIs compile down to the same core RDD operations, so they all benefit from Spark's optimization engine and can interoperate seamlessly.

Putting It All Together

Now that we've covered all the components, let's see how they work together in a real example:

// This creates RDDs but doesn't execute anything yet
val textFile = spark.textFile("hdfs://large-file.txt")
val words = textFile.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1))
val aggregated = wordCounts.reduceByKey(_ + _)

// This action triggers execution of the entire pipeline
val results = aggregated.collect()

What happens behind the scenes:

Driver creates a DAG with two stages (split by the reduceByKey shuffle)
Driver requests executors from cluster manager
Stage 1 tasks (read, flatMap, map) execute on partitions across executors
Data gets shuffled for the reduceByKey operation
Stage 2 tasks perform the aggregation
Results get collected back to the driver

Why This Architecture Matters

Understanding Spark's architecture isn't just academic knowledge - it's the key to working effectively with big data:

Fault Tolerance: The RDD lineage graph means Spark can recompute lost data automatically without manual intervention.

Scalability: The driver/executor model scales horizontally - just add more worker nodes to handle bigger datasets.

Efficiency: Lazy evaluation and DAG optimization mean Spark can optimize entire computation pipelines before executing anything.

Flexibility: The unified stack means you can mix SQL, streaming, and machine learning in the same application without data movement penalties.

Conclusion: The Beauty of Distributed Computing

Spark's architecture represents one of the most elegant solutions to distributed computing that I've encountered. By clearly separating concerns - coordination (driver), resource management (cluster manager), and execution (executors) - Spark creates a system that's both powerful and understandable.

The magic isn't in any single component, but in how they all work together. The driver's intelligence in creating optimal execution plans, the cluster manager's efficiency in resource allocation, and the executors' reliability in task execution combine to create something greater than the sum of its parts.

Whether you're processing terabytes of log data, training machine learning models, or running real-time analytics, understanding this architecture will help you reason about performance, debug issues, and design better data processing solutions.

The next time you see a Spark architecture diagram, I hope you'll see what I see now - not a confusing web of boxes and arrows, but an elegant dance of distributed computing components working in perfect harmony. Happy Sparking! 🚀

GitHub Copilot Coding Agent

sanjay@recodehive.com (Sanjay Viswanthan) — Fri, 04 Jul 2025 00:00:00 GMT

In the fast-evolving world of software development, AI-powered tools are changing the game. GitHub is at the forefront with its latest innovation: the GitHub Copilot Coding Agent. More than just an in-editor assistant, this powerful new agent works asynchronously to handle entire engineering tasks on its own. Let's dive into what it is, how it works, and how you can leverage it to automate your workflow.

🚀 What Is GitHub Coding Agent

The GitHub Copilot Coding Agent is an asynchronous software engineering agent that:

✅Takes GitHub Issues as input.
✅Writes code, runs tests, and creates pull requests—just like a teammate.
✅Works inside GitHub Actions, unlike the real-time agent mode in your IDE (e.g., VS Code).

🔧 How It Works

1. Write & Assign an Issue to Copilot
When creating an issue for the GitHub Copilot Coding Agent, clarity and structure are key to getting the best results. Here’s how to craft an effective issue that sets Copilot up for success:

Provide Clear Context:
Begin by describing the problem or feature request in detail. Explain why the change is needed, referencing any relevant background, user stories, or business goals. If the issue relates to a bug, include steps to reproduce, expected vs. actual behavior, and any error messages or screenshots.
Define Expected Outcomes:
Clearly state what a successful resolution looks like. For features, you can add the image of expected output or drawings etc.
Include Technical Details:
Add any technical constraints, dependencies, or architectural considerations. Link to relevant code, documentation, or previous issues/PRs. If there are specific files, functions, or APIs involved, mention them explicitly.
Use Templates and Repo Instructions:
Leverage your repository’s issue templates to maintain consistency. Follow any contribution guidelines or coding standards documented in the repo. This ensures Copilot’s work aligns with your team’s practices.
Assign the Issue to Copilot:
Just like you would with a human teammate, assign the issue to Copilot. This triggers the agent workflow and signals that the issue is ready for automated handling.

Example Issue Template:

Summary
Briefly describe the task or bug.

Context
Explain why this change is needed. Link to related issues or documentation.

Acceptance Criteria
- [ ] List specific outcomes or deliverables
- [ ] Include test coverage or documentation updates if needed

Technical Notes
Mention files, functions, or dependencies involved.

Additional Info
Add screenshots, logs, or references as needed.

By following these steps, you ensure Copilot has all the information it needs to deliver high-quality, context-aware code changes—making your workflow smoother and more efficient.

🌟 What Happens Next?

Once you assign the issue to GitHub Copilot, the agent will analyze the requirements and begin working asynchronously. It may take a short while for Copilot to generate the code, run tests, and open a new pull request (PR) with the proposed changes.

You can expect:

A new PR created automatically by Copilot, referencing the original issue.
An example Pull Request created by GitHub Copilot
Automated test results and code suggestions included in the PR.
Clear traceability between your issue and the resulting code changes.

Stay engaged by reviewing the PR, providing feedback, or merging it when ready. This workflow helps you leverage automation while maintaining control over your codebase.

🧭 Earn $200 by providing Early stage Feedback

💬 Share your feedback on Copilot Coding Agent for a chance to win a $200 gift card!

We’re inviting early adopters to help shape the future of the GitHub Copilot Coding Agent. Your insights are invaluable in improving the agent’s usability, reliability, and overall experience. By participating, you’ll have the opportunity to directly influence upcoming features and enhancements.

📍Note: The following feedback program was available for early adopters and may no longer be active. Please check the official GitHub blog for current opportunities.

How to participate:

Try out the Copilot Coding Agent:
Use the agent to automate coding tasks, resolve issues, or create pull requests in your repository.
Share your experience:
Provide detailed feedback on what worked well, what could be improved, and any challenges you faced. Screenshots, suggestions, and real-world use cases are especially helpful.

Why participate?

The most insightful and actionable feedback will be eligible for a $200 gift card.
Help make Copilot Coding Agent more effective for the entire developer community.
Get early access to new features and updates.

✅ Conclusion

The GitHub Copilot Coding Agent represents a significant step forward in developer productivity and workflow automation. By integrating AI-driven code generation and automated pull requests directly into your GitHub processes, you can streamline repetitive tasks and focus on higher-level problem solving. While automation accelerates development, human insight and collaboration remain essential for delivering quality software. Embrace these tools to enhance your workflow, but always keep user needs and team goals at the center of your development process.

🎥 Watch the Demo

Check out this video walkthrough of the GitHub Copilot Coding Agent in action:

recode hive Blog

Why Data Engineers Make Better Business Analysts Than MBAs Do

First, a Fair Definition of Terms​

Reason #1: Data Engineers Know When the Number Is Wrong​

Reason #2: They Understand the Difference Between What Data Says and What Data Means​

Reason #3: They Think in Systems, Not Snapshots​

Reason #4: They Know What It Costs to Answer a Question​

Reason #5: They Have Built Things That Failed in Production​

Where MBAs Still Have the Edge (And Data Engineers Should Admit It)​

What the Best Business Analysts of the Next Decade Will Look Like​

The Practical Implication for Hiring Managers​

The Practical Implication for Data Engineers​

Key Takeaways​

Frequently Asked Questions​

References and Further Reading​

About the Author​

How We Used Purview Data Catalog to Reduce Onboarding Time for New Data Engineers from 2 Weeks to 3 Days

The Problem, Measured​

Our Data Estate Before Purview​

The Three Purview Capabilities That Moved the Needle​

Capability 1: Searchable Data Catalog (Week 1 unlock)​

Capability 2: Lineage Visualization (Day 2–3 unlock)​

Capability 3: Business Glossary + Ownership Metadata (The trust layer)​

What the Onboarding Experience Looks Like Now​

The Numbers, Before and After​

What We Got Wrong the First Time​

The Configuration Checklist​

Before You Start: What Purview Cannot Do​

Key Takeaways​

Frequently Asked Questions​

References and Further Reading​

About the Author​

PySpark Optimization Techniques: 6 Mistakes That Slow Down Every Beginner's Pipeline

The Pipeline We'll Optimize​

Mistake #1: Wrong Number of Shuffle Partitions​

Understanding Partitions First​

Mistake #2: Caching Everything (Or Nothing)​

What Caching Actually Does​

Mistake #3: Using the Wrong Join Strategy​

The Three Strategies​

Mistake #4: Writing Python UDFs Instead of Using Built-in Functions​

Why UDFs Are Expensive​

Mistake #5: Reading More Data Than Necessary​

Predicate Pushdown​

Column Pruning​

Mistake #6: Default Cluster Configuration​

The Key Settings and What They Do​

Right-Sizing for Our 8GB Pipeline​

Before and After Summary​

PySpark Optimization Checklist​

Key Lessons​

Frequently Asked Questions​

References and Further Reading​

About the Author​

Azure Data Pipeline Cost Optimization: How We Cut a $4,200 Bill by 73%

The Pipeline Architecture​

Mistake #1: Dedicated SQL Pool Running 24/7​

The Fix: Auto-Pause with Azure Automation Runbooks​

Mistake #2: Full Load Running Every Night Instead of Incremental​

The Fix: Watermark-Based Incremental Loading​

Mistake #3: Spark Cluster Over-Provisioned for the Actual Workload​

The Fix: Right-Sizing, Autoscale, and Fast Termination​

Mistake #4: Reading ADLS Gen2 Files Without Partition Pruning​

Mistake #5: Keeping Historical Data on Hot Storage Tier​

The Fix: ADLS Gen2 Lifecycle Management Policy​

Mistake #6: A Streaming Pipeline for 15-Minute Update Requirements​

The Fix: Micro-Batch with ADF Tumbling Window Trigger​

Before and After Summary​

Cost Optimization Checklist​

Key Lessons​

Frequently Asked Questions​

References and Further Reading​

About the Author​

Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare

So, What Is It?​

🥉 Bronze: The "Keep Everything" Layer​

What Bronze looks like in practice​

Key rules for Bronze​

🥈 Silver: Where the Real Work Happens​

What Silver looks like in practice​

First, a Fair Definition of Terms

Reason #1: Data Engineers Know When the Number Is Wrong

Reason #2: They Understand the Difference Between What Data Says and What Data Means

Reason #3: They Think in Systems, Not Snapshots

Reason #4: They Know What It Costs to Answer a Question

Reason #5: They Have Built Things That Failed in Production

Where MBAs Still Have the Edge (And Data Engineers Should Admit It)

What the Best Business Analysts of the Next Decade Will Look Like

The Practical Implication for Hiring Managers

The Practical Implication for Data Engineers

Key Takeaways

Frequently Asked Questions

References and Further Reading

About the Author

The Problem, Measured

Our Data Estate Before Purview

The Three Purview Capabilities That Moved the Needle

Capability 1: Searchable Data Catalog (Week 1 unlock)

Capability 2: Lineage Visualization (Day 2–3 unlock)

Capability 3: Business Glossary + Ownership Metadata (The trust layer)

What the Onboarding Experience Looks Like Now

The Numbers, Before and After

What We Got Wrong the First Time

The Configuration Checklist

Before You Start: What Purview Cannot Do

Key Takeaways

Frequently Asked Questions

References and Further Reading

About the Author

The Pipeline We'll Optimize

Mistake #1: Wrong Number of Shuffle Partitions

Understanding Partitions First

Mistake #2: Caching Everything (Or Nothing)

What Caching Actually Does

Mistake #3: Using the Wrong Join Strategy

The Three Strategies

Mistake #4: Writing Python UDFs Instead of Using Built-in Functions

Why UDFs Are Expensive

Mistake #5: Reading More Data Than Necessary

Predicate Pushdown

Column Pruning

Mistake #6: Default Cluster Configuration

The Key Settings and What They Do

Right-Sizing for Our 8GB Pipeline

Before and After Summary

PySpark Optimization Checklist

Key Lessons

Frequently Asked Questions

References and Further Reading

About the Author

The Pipeline Architecture

Mistake #1: Dedicated SQL Pool Running 24/7

The Fix: Auto-Pause with Azure Automation Runbooks

Mistake #2: Full Load Running Every Night Instead of Incremental

The Fix: Watermark-Based Incremental Loading

Mistake #3: Spark Cluster Over-Provisioned for the Actual Workload

The Fix: Right-Sizing, Autoscale, and Fast Termination

Mistake #4: Reading ADLS Gen2 Files Without Partition Pruning

Mistake #5: Keeping Historical Data on Hot Storage Tier

The Fix: ADLS Gen2 Lifecycle Management Policy

Mistake #6: A Streaming Pipeline for 15-Minute Update Requirements

The Fix: Micro-Batch with ADF Tumbling Window Trigger

Before and After Summary

Cost Optimization Checklist

Key Lessons

Frequently Asked Questions

References and Further Reading

About the Author

So, What Is It?

🥉 Bronze: The "Keep Everything" Layer

What Bronze looks like in practice

Key rules for Bronze

🥈 Silver: Where the Real Work Happens

What Silver looks like in practice

What Silver looks like in storage

🥇 Gold: Built for Business, Not Engineers

What Gold looks like in practice

What Gold looks like in storage

Why This Actually Matters

It's Not Always Perfect