Python vs. PySpark Notebooks in Microsoft Fabric
A simple rule of thumb: for small to medium datasets (under ~100M rows), plain Python is typically faster and cheaper on CU costs. PySpark is the better choice once you're dealing with large datasets — think 100M+ rows or 10 GB+.
One thing worth keeping in mind: processing costs don't just apply to notebooks. Data Pipelines also consume CUs, and they can rack up costs quickly if you're not careful.
In a recent project, I needed to reduce a dataset down to 310 million rows. Given the scale, I ran the entire pipeline in PySpark — and it was the right call.

Power BI in the Browser with Microsoft Fabric
I grew up on desktop tools — Power BI Desktop, Tableau, RStudio. But with Microsoft Fabric, I do everything in the browser, and honestly? I love it.
Notebooks, semantic models, Power BI reports — all in one place. Load the data, transform it, set up a pipeline, move it through Bronze → Silver → Gold, build the semantic model, and publish the report. No context switching. No desktop installs.
Sure, the browser isn't 100% there yet — advanced modelling and Row-Level Security still need the desktop. But we're at 80%, and that 80% covers most of the work.
One environment. One flow. That's the win.

From API to Report: Building an End-to-End Data Pipeline in Microsoft Fabric
A walkthrough of a fully automated pipeline that pulls data from a REST API, transforms it through a Lakehouse and Warehouse, and surfaces insights in a Power BI report — all orchestrated inside Fabric.

Why Fabric?
Microsoft Fabric brings together data engineering, warehousing, and BI into a single unified platform. Instead of stitching together multiple services and credentials, everything lives in one workspace — from the raw API call to the finished Power BI report.
Step 1 — Fetch data from the API (notebook)
The pipeline kicks off by setting a timestamp variable, then firing a Fabric notebook that calls the REST API. The raw JSON responses land in the Files section of the API Lakehouse, preserving the original payload for full auditability.
Step 2 — Parse JSON into a Lakehouse table (notebook)
A second notebook reads those JSON files and flattens them into a structured Delta table inside the same Lakehouse. This is where field mapping, type casting, and any light cleansing happen — keeping the transformation logic version-controlled and reproducible.
Step 3 — Copy data from Lakehouse to Warehouse (copy activity)
A Copy data activity moves the cleansed rows from the Lakehouse Delta table into a staging table in the API Warehouse. The separation between Lakehouse and Warehouse keeps raw storage costs low while giving the downstream SQL layer proper query performance and access control.
Step 4 — Load the reporting table via stored procedure
A stored procedure merges the staging data into the final reporting table and updates the accompanying reporting view. Encapsulating this logic in a stored procedure means the pipeline step stays thin — and the merge logic can be tested and iterated independently.
Step 5 — Refresh the semantic model and serve the report
The final pipeline step triggers a semantic model refresh in Power BI. Once the model is updated, the report surfaces the latest data — completing the journey from raw API payload to interactive dashboard, fully automated and repeatable on any schedule.
Data Pipeline "ForEach" = Python's for loop
Microsoft Fabric's ForEach activity works just like a Python for loop — iterate over a collection, run some logic for each item, repeat.
Processing files? Feed it ["file1.csv", "file2.csv"], wrap a Copy activity inside, and use @item() to reference each file dynamically. Same mental model as looping through a list in Python.

Combining Medallion Architecture with CI/CD in Microsoft Fabric
Everyone agrees Medallion architecture delivers competitive advantages — faster insights, better data trust, greater scalability. But knowing why it works is only half the battle. The part that's rarely talked about is how to actually implement it.
My take: pair it with CI/CD. Map the Bronze, Silver, and Gold layers across your development, test, and production pipeline, and suddenly Medallion stops being a theoretical framework and starts being something you can actually ship.

Reading PDFs: Azure AI Document Intelligence vs. LLMs
Traditional PDF extraction tools may feel outdated compared to today's powerful LLMs — but the distinction matters. Traditional OCR-based tools like Azure AI Document Intelligence extract rather than infer, which is their key advantage: they return field-level confidence scores that make outputs fully auditable.
LLMs are harder to audit. More critically, they carry hallucination risk — under uncertainty, an LLM may confabulate (fabricate to fill gaps!!!) a plausible-looking number rather than return a low confidence score or flag the ambiguity. In a high-stakes context like financial reporting, that silent failure is arguably worse than a flagged extraction failure.


Example Queries for Microsoft Fabric Data Agents
When you set up a Data Agent in Microsoft Fabric, you can give it a set of example queries — sometimes called "few-shot examples" — to help it understand how to respond to questions.
Think of them as cheat sheet entries: you provide a sample question alongside the query logic that should answer it. The agent learns from these patterns and uses them to handle similar questions on its own.
In the image below, you can see this in action:
-
(1) is where you define your example queries as a creator
-
(2) shows the agent's answer when that example is applied
-
(3) is where it gets interesting — here I simply asked the agent to visualize the data, and it knew exactly what to do, drawing on the patterns it had learned
That last part is the magic of few-shot examples: once the agent has good patterns to work from, it can go beyond the examples themselves and handle new requests with confidence.
FICTIVE DATA - NOT REAL NAMES

UDF (User Defined Functions) in PySpark
Standard DataFrame operations cover most needs, but UDFs fill the gaps when you need custom business logic that can't be expressed with built-in Spark functions — things like complex string parsing, conditional transformations involving multiple rules, or calling external libraries (e.g., dateutil, re, custom validators).
Simple Example: Classifying a Customer's Spend Tier
Imagine you have a sales DataFrame and want to tag each customer with a business-defined tier label based on their total spend — logic that's too specific for a simple when/otherwise chain.
Defining the schema explicitely
The rule of thumb: Schema inference is fine for exploration. Defined schemas are essential for production pipelines. In Fabric, where we're building data pipelines that run repeatedly on large datasets, defining the schema upfront is a best practice we want to build the habit of early

Partitioning in PySpark
Partitioning is one of those topics that becomes very important as our data grows. Spark processes data in parallel chunks called partitions. Partitioning strategy directly affects:
-
Query speed — Spark can skip irrelevant partitions entirely (called partition pruning)
-
Shuffle cost — bad partitioning causes expensive data movement across nodes
-
Memory pressure — too few partitions = large chunks per node, too many = overhead
Partition by columns you frequently filter on. A good partition column has low cardinality (not too many unique values) — region with 4 values is great, customer_id with millions of values would create millions of tiny files, which is harmful.

Sometimes we just need Python and Pandas
PySpark is amazing for large datasets (e.g., 10GB+). However, for smaller datasets, Python is faster and cheaper. For smaller datasets, Panda's visualization capability are superior to PySpark.

Inside the Apache Parquet Format
Apache Parquet is a columnar storage format that offers significant advantages over traditional row-based formats like CSV or JSON. Unlike row-oriented storage, Parquet organizes data by columns, which allows analytical queries to read only the relevant columns without scanning unnecessary data. It also embeds the data schema directly in the file and enables efficient compression of repetitive values — for example, a boolean column with many repeated 0s and 1s can be compressed far more effectively than in a plain text format. Microsoft Fabric makes it easy to work with Parquet files stored in OneLake or other cloud storage solutions. source: aka.ms/fabricnotes

Advantages of Managing CLS & RLS in Fabric(Not in Power BI)
Power BI's Column Level Security (CLS) and Row Level Security (RLS) were always a workaround — security defined at the reporting layer because the data source couldn't handle it properly.
With Microsoft Fabric, that excuse is gone.
Define your security rules once in the Fabric Warehouse using standard SQL. Connect Power BI via DirectQuery, and those rules are automatically enforced everywhere — in Power BI, notebooks, SQL clients, every tool. No bypass possible. No duplication. One place to maintain.
Power BI goes back to being what it should
always have been: a presentation layer.
Simpler. Safer. More maintainable.

