top of page
Franco Arda

Franco Arda, Ph.D.

Data Enginer
Microsoft Fabric

Experience:       Deutsche Bank · Daimler · Siemens · DHL · Swisscom
Tools:                 Microsoft Fabric: Python, SQL, Power BI, Data Agents
Languages:       Swiss German · German · English
Education:         Ph.D. · MBA 

 

New: Microsoft Fabric "Data Agent"
 

Data agents launched on March 15, 2026 — and they're potentially revolutionizing how we interact with company data.
 

How does it work? In short: all structured data (think numbers in tables) lives in Microsoft Fabric, and a Data Agent sits on top of it — letting you query that data in plain language, instantly.
 

Why this matters for family offices. Wealth is fragmented by nature — private equity, real estate, art, multiple custodian banks. Your clients expect you to have the full picture at any moment. With a Data Agent, you can answer questions like these in seconds:
 

  • "Which of my clients have exposure to US tech right now?"

  • "Visualize the asset allocation of customer X."

DEMO.png

                                                                                                         NAMES ARE FICTIVE

Why not just use a general AI like Claude or ChatGPT?
Those tools are excellent for many tasks — but they're not built for processing large volumes of enterprise transactions. At scale, they hallucinate and break down. A proper Data Agent requires data to be structured and governed at the enterprise level to work reliably.

Compliance (Switzerland)
This is where it gets important for Swiss financial firms:

 

  • Data stays in Switzerland. Only the prompt (your question) leaves the country — not the underlying data. The agent operates at schema level, meaning it receives instructions about the structure, not the data itself.
     

  • PII protection. Personal Identifiable Information in user queries can be masked before leaving Switzerland, using several available solutions.
     

  • FINMA compliance. For Swiss financial institutions, Microsoft Fabric can be made compliant with FINMA supervisory communication 03/2024 on cyber risks — with the addition of Microsoft Purview.

Python vs. PySpark Notebooks in Microsoft Fabric


A simple rule of thumb: for small to medium datasets (under ~100M rows), plain Python is typically faster and cheaper on CU costs. PySpark is the better choice once you're dealing with large datasets — think 100M+ rows or 10 GB+.
 

One thing worth keeping in mind: processing costs don't just apply to notebooks. Data Pipelines also consume CUs, and they can rack up costs quickly if you're not careful.
 

In a recent project, I needed to reduce a dataset down to 310 million rows. Given the scale, I ran the entire pipeline in PySpark — and it was the right call.

Python vs. PySpark

Power BI in the Browser with Microsoft Fabric


I grew up on desktop tools — Power BI Desktop, Tableau, RStudio. But with Microsoft Fabric, I do everything in the browser, and honestly? I love it.
 

Notebooks, semantic models, Power BI reports — all in one place. Load the data, transform it, set up a pipeline, move it through Bronze → Silver → Gold, build the semantic model, and publish the report. No context switching. No desktop installs.


Sure, the browser isn't 100% there yet — advanced modelling and Row-Level Security still need the desktop. But we're at 80%, and that 80% covers most of the work.
 

One environment. One flow. That's the win.

Power BI in Fabric

Data Pipeline "ForEach" = Python's for loop


Microsoft Fabric's ForEach activity works just like a Python for loop — iterate over a collection, run some logic for each item, repeat.
 

Processing files? Feed it ["file1.csv", "file2.csv"], wrap a Copy activity inside, and use @item() to reference each file dynamically. Same mental model as looping through a list in Python.

Microsoft_Fabric_Data_Pipeline_ForEach.png

Combining Medallion Architecture with CI/CD in Microsoft Fabric
Everyone agrees Medallion architecture delivers competitive advantages — faster insights, better data trust, greater scalability. But knowing why it works is only half the battle. The part that's rarely talked about is how to actually implement it.

My take: pair it with CI/CD. Map the Bronze, Silver, and Gold layers across your development, test, and production pipeline, and suddenly Medallion stops being a theoretical framework and starts being something you can actually ship.

Medallion_Architecture.png

Reading PDFs: Azure AI Document Intelligence vs. LLMs
 

Traditional PDF extraction tools may feel outdated compared to today's powerful LLMs — but the distinction matters. Traditional OCR-based tools like Azure AI Document Intelligence extract rather than infer, which is their key advantage: they return field-level confidence scores that make outputs fully auditable.
 

LLMs are harder to audit. More critically, they carry hallucination risk — under uncertainty, an LLM may confabulate (fabricate to fill gaps!!!) a plausible-looking number rather than return a low confidence score or flag the ambiguity. In a high-stakes context like financial reporting, that silent failure is arguably worse than a flagged extraction failure.

Azure_AI_Document_Intelligence.png
omg2.png

Example Queries for Microsoft Fabric Data Agents

When you set up a Data Agent in Microsoft Fabric, you can give it a set of example queries — sometimes called "few-shot examples" — to help it understand how to respond to questions.
 

Think of them as cheat sheet entries: you provide a sample question alongside the query logic that should answer it. The agent learns from these patterns and uses them to handle similar questions on its own.
 

In the image below, you can see this in action:

  • (1) is where you define your example queries as a creator

  • (2) shows the agent's answer when that example is applied

  • (3) is where it gets interesting — here I simply asked the agent to visualize the data, and it knew exactly what to do, drawing on the patterns it had learned


That last part is the magic of few-shot examples: once the agent has good patterns to work from, it can go beyond the examples themselves and handle new requests with confidence.

FICTIVE DATA - NOT REAL NAMES

udf_spark.png

UDF (User Defined Functions) in PySpark
 

Standard DataFrame operations cover most needs, but UDFs fill the gaps when you need custom business logic that can't be expressed with built-in Spark functions — things like complex string parsing, conditional transformations involving multiple rules, or calling external libraries (e.g., dateutil, re, custom validators).
 

Simple Example: Classifying a Customer's Spend Tier

Imagine you have a sales DataFrame and want to tag each customer with a business-defined tier label based on their total spend — logic that's too specific for a simple when/otherwise chain.

Defining the schema explicitely
The rule of thumb: Schema inference is fine for exploration. Defined schemas are essential for production pipelines. In Fabric, where we're building data pipelines that run repeatedly on large datasets, defining the schema upfront is a best practice we want to build the habit of early

schmea.png

Partitioning in PySpark
Partitioning is one of those topics that becomes very important as our data grows. Spark processes data in parallel chunks called partitions. Partitioning strategy directly affects:

 

  • Query speed — Spark can skip irrelevant partitions entirely (called partition pruning)

  • Shuffle cost — bad partitioning causes expensive data movement across nodes

  • Memory pressure — too few partitions = large chunks per node, too many = overhead
     

Partition by columns you frequently filter on. A good partition column has low cardinality (not too many unique values) — region with 4 values is great, customer_id with millions of values would create millions of tiny files, which is harmful.

partition.png

Sometimes we just need Python and Pandas
PySpark is amazing for large datasets (e.g., 10GB+). However, for smaller datasets, Python is faster and cheaper. For smaller datasets, Panda's visualization capability are superior to PySpark.

pandas.png

Inside the Apache Parquet Format

Apache Parquet is a columnar storage format that offers significant advantages over traditional row-based formats like CSV or JSON. Unlike row-oriented storage, Parquet organizes data by columns, which allows analytical queries to read only the relevant columns without scanning unnecessary data. It also embeds the data schema directly in the file and enables efficient compression of repetitive values — for example, a boolean column with many repeated 0s and 1s can be compressed far more effectively than in a plain text format. Microsoft Fabric makes it easy to work with Parquet files stored in OneLake or other cloud storage solutions. source: aka.ms/fabricnotes

fabric-notes.png

© 2026 by Franco Arda
 

Follow

bottom of page