Category Archives: Azure Databricks
From Raw to Reliable: Cleaning Data at Scale with Azure Databricks
Are you struggling with messy spreadsheets full of duplicates, missing values, and inconsistent records? You’re not alone. Data professionals spend nearly 80% of their time cleaning and preparing data before any real analysis begins. The truth is simple: without clean data, business reports are unreliable, AI models fail, and decision-making slows down. In this blog, we’ll show you how Azure Databricks makes data cleaning easier, faster, and scalable—turning raw inputs into reliable insights with just a few lines of code. Why Clean Data Matters For business leaders, whether you’re a Team Lead, CTO, or CEO, clean data directly impacts growth: With Azure Databricks, you get a cloud-native, Spark-powered platform that handles big data at scale while integrating seamlessly with Azure Data Lake, Synapse, and Power BI. Practical Example: Cleaning a Sales Dataset in Azure Databricks Imagine you have a raw CSV file in Azure Data Lake with customer sales data: Issues in the data: Solution with PySpark in Databricks: Output after cleaning: CustomerID Name Country Sales 101 Alice USA 500 102 Bob USA 300 103 Unknown UK 450 104 David India 0 With just a few lines of Spark code, the dataset is now ready for reporting, visualization, or machine learning. To conclude, clean data is the foundation of every reliable business insight. With Azure Databricks, you can automate messy, manual processes and create repeatable, scalable pipelines that keep your data reliable—no matter how fast your business grows. ✅ Start small: try building a simple cleaning pipeline in Azure Databricks today.✅ Save time: focus more on insights, less on manual data prep.✅ Scale with confidence: as your data grows, Databricks grows with you. 👉 Want to take the next step? Explore how Databricks integrates with Power BI for real-time dashboards or with MLflow for machine learning pipelines. Stay tuned for our next post where we’ll cover these use cases in detail. ✨ With Databricks, your journey from raw to reliable data starts today. Contact us today at Transform@cloudfronts.com to get started. To learn more about functionalities of DataBricks and other Azure AI services, please refer to my other blogs from the links given below: – 1] The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI – CloudFronts 2] Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search – CloudFronts 3] Using Open AI and Logic Apps to develop a Copilot agent for Elevator Pitches & Lead Qualification – CloudFronts
Share Story :
Setting Up Unity Catalog in Databricks for Centralized Data Governance
The fastest way to lose control of enterprise data? Managing governance separately across workspaces. Unity Catalog solves this with one centralized layer for security, lineage, and discovery. Data governance is crucial for any organization looking to manage and secure its data assets effectively. Databricks’ Unity Catalog is a centralized solution that provides a unified interface for managing access control, auditing, data lineage, and discovery. This blog will guide you through the process of setting up Unity Catalog in your Databricks workspace. What is Unity Catalog? Unity Catalog is Databricks’ answer to centralized data governance. It enables organizations to enforce standards-compliant security policies, apply fine-grained access controls, and visualize data lineage across multiple workspaces. It ensures compliance and promotes efficient data management. Key Features: 1] Standards-Compliant Security: ANSI SQL-based access policies that apply across all workspaces in a region. 2] Fine-Grained Access Control: Support for row- and column-level permissions. 3] Audit Logging: Tracks who accessed what data and when. 4] Data Lineage: Provides visualization of data flow and dependencies. Unity Catalog Object Hierarchy Before diving into the setup, it’s important to understand the hierarchical structure of Unity Catalog: 1] Catalogs: The top-level container (e.g., Production, Development) that represents an organizational unit or environment. 2] Schemas: Logical groupings of tables, views, and AI models within a catalog. 3] Tables and Views: These include managed tables fully governed by Unity Catalog and external tables referencing existing cloud storage. Here is the procedure to setup a Unity Catalog Metastore in association with Azure Storage, as I have done for one of our products (SmartPitch Sales & Marketing Agent) – 1] First create a storage account with primary service being – “Azure Blob Storage or Azure Data Lake Storage Gen 2”; Performance and Redundancy can be chosen based on the requirement for which the DataBricks service is being used.Here for my Mosaic AI Agent, I have used Locally Redundant Storage & Data Lake Gen 2 2] Once the storage account is created, ensure that you have enabled “Hierarchical Namespace” When creating a Unity Catalog metastore with Azure Blob Storage, Hierarchical Namespace (HNS) is required because Unity Catalog needs: a] Folder-like structure to organize catalogs, schemas, and tables. b] Atomic operations (rename, move, delete) on directories and files. c] POSIX-style access controls for fine-grained permissions. d] Faster metadata handling for lineage and governance. HNS turns Azure Blob into ADLS Gen2, which supports these features. 3] Upload any Raw/Unclean files to your metastore folder in the blob storage, which would be required for your use in DataBricks. 4] Create a Unity Catalog Connector in Azure Portal and assign it “Storage Blob Data Contributor” Role . 5] Assign CORS (Cross-Origin Resource Sharing) settings for that storage account. Why this is necessary: In short: Without configuring CORS, Databricks cannot communicate with your storage container to read/write managed tables, schema metadata, or logs. 6] Generate SAS Token 7] Navigate to your Workspace and select “Manage Account” – this should be done from the account admin. 8] Select Catalog tab on the left and then click “Create Metastore” 9] Assign a Name, Region (Same as Workspace), The path to the storage account, and the connector id. 10] Once the Metastore is created, assign it to a workspace . 11] Once this is done, the catalogs and the schemas, and tables in within it can be created. How does Unity Catalog differ from Hive Metastore ? Feature Hive Metastore Unity Catalog Scope Workspace or cluster-specific Centralized, spans multiple workspaces and regions Architecture Single metastore tied to Spark/Hive Cloud-native service integrated with Databricks Object Hierarchy Databases → Tables → Partitions Catalogs → Schemas → Tables/Views/Models Data Assets Supported Tables, views Tables, views, files, ML models, dashboards Security Basic GRANT/DENY at database/table level Fine-grained, ANSI SQL–based (catalog, schema, table, column, row) Lineage Not available Built-in lineage and impact analysis Auditing Limited or external Integrated audit logs across workspaces Storage Management Points to storage locations; no governance Manages external and managed tables with governance Cloud Integration Primarily on cluster storage or external path Secure integration with ADLS Gen2, S3, GCS Permissions Model Spark SQL statements Attribute- and role-based access, unified policies Use Cases Basic metadata store for Spark/Hive workloads Enterprise-wide data governance, sharing, and compliance To conclude, Unity Catalog is the next-generation governance and metadata solution for Databricks, designed to give organizations a single, secure, and scalable way to manage data and AI assets. Unlike the older Hive Metastore, it centralizes control across multiple workspaces, supports fine-grained access policies, delivers built-in lineage and auditing, and integrates seamlessly with cloud storage like Azure Data Lake, S3, or GCS. When setting it up, key steps include: 1] Creating a metastore and linking it to your workspaces. 2] Enabling hierarchical namespace on Azure storage for folder-level security and operations. 3] Configuring CORS to allow Databricks domains to interact with storage. 4] Defining catalogs, schemas, and tables for structured governance. By implementing Unity Catalog, you ensure stronger security, better compliance, and faster data discovery, making your Databricks environment enterprise-ready for analytics and AI. Business Outcomes of Unity Catalog By implementing Unity Catalog, organizations can achieve: Why now? As data volumes and regulatory requirements grow, organizations can no longer rely on fragmented or legacy governance tools. Unity Catalog offers a future-proof foundation for unified data management and AI governance—essential for any modern data-driven enterprise. At CloudFronts, we help enterprises implement and optimize Unity Catalog within Databricks to ensure secure, compliant, and scalable data governance for enterprise data governance.Book a consultation with our experts to explore how Unity Catalog can simplify compliance and boost productivity for your teams.Contact us today at Transform@cloudfronts.com to get started. To learn more about functionalities of DataBricks and other Azure AI services, please refer to my other blogs from the links given below: – 1] The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI – CloudFronts 2] Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search – CloudFronts 3] Using Open AI and Logic Apps to develop a Copilot agent for … Continue reading Setting Up Unity Catalog in Databricks for Centralized Data Governance
Share Story :
Building the AI Bridge: How CloudFronts Helps You Connect Systems That Talk to Each Other
When we say building a bridge? Does it mean something isn’t connected together? And what is it?It’s AI itself and your systems that are not connected. What this means if although your AI can access your systems to derive information, it’s still unreliable, slow. What is needed for AI to be successful? In order for AI to be successful, below is what to avoid: In order to eliminate the above, we must have a layer of ‘catalog’ which will house all business data together so that a common vocabulary is established between systems. AI then pools from this ‘Data catalog’ to perform agentic actions. The diagram below best explains, on a high level, how this looks : And all this is defined by how well the integrations between these systems are established. How CloudFronts Can Help? CloudFronts has deep integration expertise where we connected cloud-based applications with each other with the below in mind – Often times, we find ready-made plug and play cloud-based integration solutions which come with their own hefty licensing that keeps going up every few years. Using such integration tools not only affects cash flow but also adds a layer of opaqueness, as we don’t control the flow of integration, and we cannot granularize it beyond what’s offered. Custom integration gives you better control and analytics, which readymade solutions can’t.Here’s a CloudFronts Case Study published by Microsoft, wherein we connected systems for our customer with multiple systems driving data and insights. To conclude, AI Agents are meant to be for your organization aren’t optimized to work right away. This disconnect needs to be engineered just like any other implementation project today. As this gap is real and must be fulfilled by something called Unity Catalog and integrations, CloudFronts can help bridge this gap and make AI work for your organization to continue to optimize cash flow against rising costs. We hope you found this blog useful, and if you would like to discuss anything, you can reach out to us at transform@cloudfonts.com.
Share Story :
Create No Code Powerful AI Agents – Azure AI Foundry
An AI agent is a smart program that can think, make decisions, and do tasks. Sometimes it works alone, and sometimes it works with people or other agents. The main difference between an agent and a regular assistant is that agents can do things on their own. They don’t just help—you can give them a goal, and they’ll try to reach it. Every AI agent has three main parts: Agents can take input like a message or a prompt and respond with answers or actions. For example, they might look something up or start a process based on what you asked. Azure AI Foundry is a platform that brings all these things together; so you can build, train, and manage AI agents easily. References What is Azure AI Foundry Agent Service? – Azure AI Foundry | Microsoft Learn Understanding deployment types in Azure AI Foundry Models – Azure AI Foundry | Microsoft Learnhttps://learn.microsoft.com/en-us/azure/ai-foundry/how-to/index-add Usage Firstly, we create a project in Azure AI Foundry. Click on Next and give a name to your project. Wait till the setup finishes. Once the project creation finishes we are greeted with this screen. Click on Agents tab and click on Next to choose the model. I’m currently using GPT-4o Mini. It also includes descriptions for all the available models. Then we configure the deployment details. There are multiple deployment types available such as – Global Deployments Data Zone Standard Deployments Standard deployments [Standard] follow a pay-per-use model perfect for getting started quickly.They’re best for low to medium usage with occasional traffic spikes. However, for high and steady loads, performance may vary.Provisioned deployments [ProvisionedManaged] let you pre-allocate the amount of processing power you need.This is measured using Provisioned Throughput Units (PTUs). Each model and version requires a different number of PTUs and offers different performance levels. Provisioned deployments ensure predictable and stable performance for large or mission-critical workloads. This is how the deployment details look for in Global Standard. I’ll be choosing Standard deployment for our use case. Click on deploy and wait for a few seconds. Once the deployment is completed, you can give your agent a name and some instructions for their behavior. You should specify the tone, end goal, verbosity, etc as well. You can also specify the Temperature and Top P values which are both a control on the randomness or creativeness of the model. Temperature controls how bold or cautious the model is. Lower temperature = Safer, more predictable answers. (Factual Q&A, Code Summarization)Higher temperature = More creative or surprising answers. (Poetry/Creative writing) Top P (Nucleus Sampling) controls how wide the model’s word choices are. Lower Top P = Only picks from the most likely words. (Legal or financial writing) Higher Top P = Includes less likely, more diverse words. (Brainstorming names) Next, I’ll add a knowledge base to my bot. For this example, I’ll just upload a single file.However, you have the option to add an sharepoint folder or files, connect it to Bing Search, MS Fabric, Azure AI search, etc as required. A Vector store in Azure AI Foundry helps your AI agent retrieve relevant information based on meaning rather than just keywords.It works by breaking your content (like a PDF) into smaller parts, converting them into numerical representations (embeddings), and storing them.When a user asks a question, the AI finds the most semantically similar parts from the vector store and uses them to generate accurate, context-aware responses. Once you select the file, click on Upload and save. At this point, you can start to interact with your model. To “play around” with your model, click on the “Try in Playground” button. And here, we can see the output based on our provided knowledge base. One more example, just because it is kind of fun. Every input that you provide to the agent is called as a “message”. Everytime the agent is invoked for processing the provided input is called a “run”. Every interaction session with the agent is called a “thread”. We can see all the open threads in the threads section. To conclude, Azure AI Foundry makes it easy to build and use AI agents without writing any code. You can choose models, set how they behave, and connect your data all through a simple interface. Whether you’re testing ideas, automating tasks, or building custom bots, Foundry gives you the tools to do it.If you’re curious about AI or want to try building your agent, Foundry is a great place to begin. We hope you found this blog useful, and if you would like to discuss anything, you can reach out to us at transform@cloudfonts.com
Share Story :
Struggling with Siloed Systems? Here’s How CloudFronts Gets You Connected
In today’s world, we use many different applications for our daily work. One single application can’t handle everything because some apps are designed for specific tasks. That’s why organizations use multiple applications, which often leads to data being stored separately or in isolation. In this blog, we’ll take you on a journey from siloed systems to connected systems through a customer success story. About BÜCHI Büchi Labortechnik AG is a Swiss company renowned for providing laboratory and industrial solutions for R&D, quality control, and production. Founded in 1939, Büchi specializes in technologies such as: Their equipment is widely used in pharmaceuticals, chemicals, food & beverage, and academia for sample preparation, formulation, and analysis. Büchi is known for its precision, innovation, and strong customer support worldwide. Systems Used by BÜCHI To streamline operations and ensure seamless collaboration, BÜCHI leverages a variety of enterprise systems: Infor and SAP Business One are utilized for managing critical business functions such as finance, supply chain, manufacturing, and inventory. Reporting Challenges Due to Siloed Systems Organizations often rely on multiple disconnected systems across departments — such as ERP, CRM, marketing platforms, spreadsheets, and legacy tools. These siloed systems result in: The Need for a Single Source of Truth To solve these challenges, it’s critical to establish a Single Source of Truth (SSOT) — a central, trusted data platform where all key business data is: How We Helped Büchi Connect Their Systems To build a seamless and scalable integration framework, we leveraged the following Azure services: >Azure Logic Apps – Enabled no-code/low-code automation for integrating applications quickly and efficiently. >Azure Functions – Provided serverless computing for lightweight data transformations and custom logic execution. >Azure Service Bus – Ensured reliable, asynchronous communication between systems with FIFO message processing and decoupling of sender/receiver availability. >Azure API Management (APIM) – Secured and simplified access to backend services by exposing only required APIs, enforcing policies like authentication and rate limiting, and unifying multiple APIs under a single endpoint. BÜCHI’s case study was published on the Microsoft website, highlighting how CloudFronts helped connect their systems and prepare their data for insights and AI-driven solutions. Why a Single Source of Truth (SSOT) Is Important A Single Source of Truth means having one trusted location where your business stores consistent, accurate, and up-to-date data. Key Reasons It Matters: How we did this We used Azure Function Apps, Service Bus, and Logic Apps to seamlessly connect the systems. Databricks was implemented to build a Unity Catalog, establishing a Single Source of Truth (SSOT). On top of this unified data layer, we enabled advanced analytics and reporting using Power BI. In May, we hosted an event with BÜCHI at the Microsoft Office in Zurich. During the session, one of the attending customers remarked, “We are five years behind BÜCHI.” Another added, “If we don’t start now, we’ll be out of the race in the future.” This clearly reflects the urgent need for businesses to evolve. Today, Connected Systems, a Single Source of Truth (SSOT), Advanced Analytics, and AI are not optional — they are essential for sustainable growth and improved human efficiency. The pace of transformation has accelerated: tasks that once took months can now be achieved in days — and soon, perhaps, with just a prompt. To conclude, if you’re operating with multiple disconnected systems and relying heavily on manual processes, it’s time to rethink your approach. System integration and automation free your teams from repetitive work and empower them to focus on high impact, strategic activities. We hope you found this blog useful, and if you would like to discuss anything, you can reach out to us at transform@cloudfonts.com.
Share Story :
Azure Databricks – How to read CSV file from blob storage and push the data into a synapse SQL pool table
In this blog, we will learn how to read CSV file from blob storage and push data into a synapse SQL pool table using Azure Databricks python script. In part1 we created an Azure synapse analytics workspace, dedicated SQL pool in this we have seen how to create a dedicated SQL pool. In this blog, we will use a JDBC connection string to connect the SQL pool. Step 1: Sign to the Azure portal. Open Azure Databricks and click on lunch workspace to create a new Notebook. Step 2: Once the Azure Databricks Studio opens click on New Notebook and select your language, here I have selected “Python” language. Step 3: Add the following code to connect your dedicated SQL pool using the JDBC connection string and push the data into a table. Python script : from azure.storage.blob import BlobServiceClient import pandas as pd import io import pyspark.sql storage_account_name = ‘Your Storage account name’ storage_account_access_key = ‘Your Storage account access key’ spark.conf.set(‘fs.azure.account.key.’ + storage_account_name + ‘.blob.core.windows.net’, storage_account_access_key) blob_container = ‘Your container name’ filePath = “wasbs://” + blob_container + “@” + storage_account_name + “.blob.core.windows.net/Your CSV file name” empDf = spark.read.format(“csv”).load(filePath, inferSchema = True, header = True) connectionString=”Your JDSB connection sting;encrypt=true;trustServerCertificate=false;rewriteBatchedStatements=true;loginTimeout=30;” empDf.write.jdbc(connectionString,”[dbo].[Employee]”, mode=”append”) Step 4: You can get the JDBC connection string >> First open Synapse work space on the left pane in Analytics pools open SQL pool. Select your SQL pool, in overview you can find the link “Show database connection strings” and clicked on JDBC tab and copy the connection string. Step 7: Now click on the Run All button to execute the main script. Your script will execute successfully and also check in the SQL table. Hope this will help.
Share Story :
Azure Databricks – Part 1 – How to create Azure Databricks workspace and a Spark Cluster?
In this blog, we will learn how to create Azure Databricks workspace and a Spark Cluster step by step using the Azure portal. Create Azure Databricks workspace: Step 1: To create Azure Databricks workspace, sign in to the Azure portal. In the upper-left corner of the home page, select Create a resource. In the Search, the Marketplace box, enter Azure Databricks and select and press enter Step 2: Select Azure Databricks from the search result and click on the create button. Step 3: Click on the create button and enter the following information Subscription Resource group Workspace name Region Pricing tier Step 4: Click the Review + create tab before click on the create button. Once you click on the create button it will take 3 to 4 minutes to create a resource. Create a Spark Cluster in Azure Databricks: Step 1: In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace Step 2: You are redirected to the Azure Databricks portal. From the portal, click New Cluster. Step 3: In the New cluster page, provide the values to create a cluster. Hope this will help.
Share Story :
Azure Databricks – Part 2 – How to read Amazon DynmoDB table data using NoteBooks
In this blog, we will learn how to connect AWS DynmoDB and read the table data using Python script step by step. Step 1: In the left pane, select Azure Databricks. From the Common Tasks, select New Notebook. Step 2: In the Create Notebook dialog box, enter a name, select Python as the language, and select the Spark cluster that you created earlier. Step 3: Once the Notebook creates you can write a python script to connect AWS DynmoDB using the boto3 client library. To connect AWS DynmoDB you must have an AWS access key ID and AWS secret access key. Python script : # Databricks notebook source import boto3 import pandas as pd session = boto3.session.Session(aws_access_key_id=’your AWS access key ID’,aws_secret_access_key=’your AWS secret access key’,region_name=’your region’) dynamodb = session.resource(“dynamodb”) table = dynamodb.Table(“Table Name”) response = table.scan() items = response[“Items”] data = pd.DataFrame(items) output = data.to_csv (index_label=”idx”, encoding = “utf-8”) print(output) Step 4: Now you can check the output by pressing the Shift + Enter key or click on the Run cell. Hope this will help.