Setting Up Unity Catalog in Databricks for Centralized Data Governance
The fastest way to lose control of enterprise data? Managing governance separately across workspaces. Unity Catalog solves this with one centralized layer for security, lineage, and discovery. Data governance is crucial for any organization looking to manage and secure its data assets effectively. Databricks’ Unity Catalog is a centralized solution that provides a unified interface for managing access control, auditing, data lineage, and discovery. This blog will guide you through the process of setting up Unity Catalog in your Databricks workspace. What is Unity Catalog? Unity Catalog is Databricks’ answer to centralized data governance. It enables organizations to enforce standards-compliant security policies, apply fine-grained access controls, and visualize data lineage across multiple workspaces. It ensures compliance and promotes efficient data management. Key Features: 1] Standards-Compliant Security: ANSI SQL-based access policies that apply across all workspaces in a region. 2] Fine-Grained Access Control: Support for row- and column-level permissions. 3] Audit Logging: Tracks who accessed what data and when. 4] Data Lineage: Provides visualization of data flow and dependencies. Unity Catalog Object Hierarchy Before diving into the setup, it’s important to understand the hierarchical structure of Unity Catalog: 1] Catalogs: The top-level container (e.g., Production, Development) that represents an organizational unit or environment. 2] Schemas: Logical groupings of tables, views, and AI models within a catalog. 3] Tables and Views: These include managed tables fully governed by Unity Catalog and external tables referencing existing cloud storage. Here is the procedure to setup a Unity Catalog Metastore in association with Azure Storage, as I have done for one of our products (SmartPitch Sales & Marketing Agent) – 1] First create a storage account with primary service being – “Azure Blob Storage or Azure Data Lake Storage Gen 2”; Performance and Redundancy can be chosen based on the requirement for which the DataBricks service is being used.Here for my Mosaic AI Agent, I have used Locally Redundant Storage & Data Lake Gen 2 2] Once the storage account is created, ensure that you have enabled “Hierarchical Namespace” When creating a Unity Catalog metastore with Azure Blob Storage, Hierarchical Namespace (HNS) is required because Unity Catalog needs: a] Folder-like structure to organize catalogs, schemas, and tables. b] Atomic operations (rename, move, delete) on directories and files. c] POSIX-style access controls for fine-grained permissions. d] Faster metadata handling for lineage and governance. HNS turns Azure Blob into ADLS Gen2, which supports these features. 3] Upload any Raw/Unclean files to your metastore folder in the blob storage, which would be required for your use in DataBricks. 4] Create a Unity Catalog Connector in Azure Portal and assign it “Storage Blob Data Contributor” Role . 5] Assign CORS (Cross-Origin Resource Sharing) settings for that storage account. Why this is necessary: In short: Without configuring CORS, Databricks cannot communicate with your storage container to read/write managed tables, schema metadata, or logs. 6] Generate SAS Token 7] Navigate to your Workspace and select “Manage Account” – this should be done from the account admin. 8] Select Catalog tab on the left and then click “Create Metastore” 9] Assign a Name, Region (Same as Workspace), The path to the storage account, and the connector id. 10] Once the Metastore is created, assign it to a workspace . 11] Once this is done, the catalogs and the schemas, and tables in within it can be created. How does Unity Catalog differ from Hive Metastore ? Feature Hive Metastore Unity Catalog Scope Workspace or cluster-specific Centralized, spans multiple workspaces and regions Architecture Single metastore tied to Spark/Hive Cloud-native service integrated with Databricks Object Hierarchy Databases → Tables → Partitions Catalogs → Schemas → Tables/Views/Models Data Assets Supported Tables, views Tables, views, files, ML models, dashboards Security Basic GRANT/DENY at database/table level Fine-grained, ANSI SQL–based (catalog, schema, table, column, row) Lineage Not available Built-in lineage and impact analysis Auditing Limited or external Integrated audit logs across workspaces Storage Management Points to storage locations; no governance Manages external and managed tables with governance Cloud Integration Primarily on cluster storage or external path Secure integration with ADLS Gen2, S3, GCS Permissions Model Spark SQL statements Attribute- and role-based access, unified policies Use Cases Basic metadata store for Spark/Hive workloads Enterprise-wide data governance, sharing, and compliance To conclude, Unity Catalog is the next-generation governance and metadata solution for Databricks, designed to give organizations a single, secure, and scalable way to manage data and AI assets. Unlike the older Hive Metastore, it centralizes control across multiple workspaces, supports fine-grained access policies, delivers built-in lineage and auditing, and integrates seamlessly with cloud storage like Azure Data Lake, S3, or GCS. When setting it up, key steps include: 1] Creating a metastore and linking it to your workspaces. 2] Enabling hierarchical namespace on Azure storage for folder-level security and operations. 3] Configuring CORS to allow Databricks domains to interact with storage. 4] Defining catalogs, schemas, and tables for structured governance. By implementing Unity Catalog, you ensure stronger security, better compliance, and faster data discovery, making your Databricks environment enterprise-ready for analytics and AI. Business Outcomes of Unity Catalog By implementing Unity Catalog, organizations can achieve: Why now? As data volumes and regulatory requirements grow, organizations can no longer rely on fragmented or legacy governance tools. Unity Catalog offers a future-proof foundation for unified data management and AI governance—essential for any modern data-driven enterprise. At CloudFronts, we help enterprises implement and optimize Unity Catalog within Databricks to ensure secure, compliant, and scalable data governance for enterprise data governance.Book a consultation with our experts to explore how Unity Catalog can simplify compliance and boost productivity for your teams.Contact us today at Transform@cloudfronts.com to get started. To learn more about functionalities of DataBricks and other Azure AI services, please refer to my other blogs from the links given below: – 1] The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI – CloudFronts 2] Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search – CloudFronts 3] Using Open AI and Logic Apps to develop a Copilot agent for … Continue reading Setting Up Unity Catalog in Databricks for Centralized Data Governance
Share Story :
Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search
In modern enterprises, documents stored across platforms like SharePoint often remain underutilized due to the lack of intelligent search capabilities. What if your organization could automatically extract meaning from those documents—turning them into searchable vectors for advanced retrieval systems? That’s exactly what we’ve achieved by integrating Azure Logic Apps with Azure AI Search. Workflow Overview Whenever a user uploads a file to a designated SharePoint folder, a scheduled Azure Logic App is triggered to: Once stored, a scheduled Azure Cognitive Search Indexer kicks in. This indexer: Technologies / resources used: –-> SharePoint: A common document repository for enterprise users, ideal for collaborative uploads. -> Azure Logic Apps: Provides low-code automation to monitor SharePoint for changes and sync files to Blob Storage. It ensures a reliable, scheduled trigger mechanism with minimal overhead. -> Blob Storage: Serves as the staging ground where documents are centrally stored for indexing—cheaper and more scalable than relying solely on SharePoint connectors. -> Azure AI Search (Cognitive Search): The intelligence layer that runs a skillset pipeline to extract, transform, and vectorize the content, enabling semantic search, multimodal RAG (Retrieval Augmented Generation), and other AI-enhanced scenarios. Why Not Vectorize Directly from SharePoint? Reference:-1. https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online2. https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage How to achieve this? Stage 1: – Logic App to sync Sharepoint files to blob Firstly, create a designated Sharepoint directory to upload the required documents for vectorization. Then create the logic app to replicate the files along with it’s format and properties to the associated blob storage – 1] Assign the site address and the directory name where the documents are uploaded in Sharepoint – In the trigger action “When an item is created or modified”. 2] Assign a recurrence frequency, start time and time zone to check/verify for new documents and keep the blob container updated. 3] Add an action component – “Get file content using path”; and dynamically provide the full path (includes file extension), from the trigger 4] Finally, add an action to create blobs in the designated container that would be vectorized – provide the storage acc. name, directory path, the name of blob (Select to dynamically get the file name with extension for the trigger), blob content (from the get file content action). 5] On successfully saving & running this logic app, either manually or on trigger, the files are replicated in it’s exact form to the blob storage. Stage 2 :- Azure AI Search resource to vectorize the files in blob storage In Azure Portal (Home – Microsoft Azure), search for Azure AI Search service, and provide the necessary details, based on your requirement select a pricing tier. Once resource is successfully created, select “Import & vectorize data” From the 2 options – RAG and Multimodal RAG Index, select the latter one.RAG combines a retriever (to fetch relevant documents) with a generative language model (to generate answers) using text-only data. Multimodal RAG extends the RAG architecture to include multiple data types such as text, images, tables, PDFs, diagrams, audio, or video. Workflow: Now follow the steps and provide the necessary details for the index creation Enable deletion tracking, to remove the records of deleted documents from the index Provide a document intelligence resource to enable OCR, and to get location metadata for multiple document types. Select image verbalization (to verbalize text in images) or multimodal embedding to vectorize the whole image. Assign the LLM model for generating the embeddings for the text/images provide an image output location, to store images extracted from the files Assign a schedule to refresh the indexer and to keep the search index up to date with new documents. Once successfully created, search keywords in the search explorer of the index, to verify the vectorization, the results are provided based on it’s relevance and score/distance, to the user’s search query. Let us test this index in Custom Copilot Agent , by importing this index as an azure ai search knowledge source. On fetching details of certain document specific information, the index is searched for the most appropriate information, and the result is rendered in readable format by generative AI. We hope you found this blog useful, and if you would like to discuss anything, you can reach out to us at transform@cloudfronts.com.
Share Story :
The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI
. Key Takeaways 1. Bad data kills ML performance — duplicate, inconsistent, missing, or outdated data can break production models even if training accuracy is high.2. Databricks Medallion Architecture (Raw → Silver → Gold) is essential for turning raw chaos into structured, trustworthy datasets ready for ML & BI.3. Raw layer: capture all data as-is; Silver layer: clean, normalize, and standardize; Gold layer: curate, enrich, and prepare for modeling or reporting.4. Structured pipelines reduce compute cost, improve model reliability, and enable proactive monitoring & feature freshness.5. Data quality is as important as algorithms — invest in transformation and governance for scalable, accurate AI. While developing a Machine Learning Agent for Smart Pitch (Sales/Presales/Marketing Assistant chatbot) within the Databricks ecosystem, I’ve seen firsthand how unclean, inconsistent, and incomplete data can cripple even the most promising machine learning initiatives. While much attention is given to models and algorithms, what often goes unnoticed is the silent productivity killer lurking in your pipelines: bad data. In the chase to build intelligent applications, developers often overlook the foundational layer—data quality. Here’s the truth: It is difficult to scale machine learning on top of chaos. That’s where the Raw → Silver → Gold data transformation framework in Databricks becomes not just useful, but also essential. The Hidden Cost of Bad Data Imagine you’ve built a high-performance ML model. It’s accurate during training, but underperforms in production. Why?Here’s what I frequently detect when I scan input pipelines:– Duplicate records– Inconsistent data types– Missing values– Outdated information– Schema drift (Schema drift introduces malformed, inconsistent, or incomplete data that breaks validation rules and compromises downstream processes.)– Noise in DataThese issues inflate compute costs, introduce bias, and produce unstable predictions, resulting in wasted hours debugging pipelines, increased operational risks, and eroded trust in your AI outputs. Ref.: Data Poisoning: A Silent but Deadly Threat to AI and ML Systems | by Anya Kondamani | nFactor Technologies | Medium As you can see in this example – Despite successfully fetching data, it was unable to render it, due to hitting maximum request limit, as bad data was involved, thus AI Agents face issues if directly brute forced with Raw/Unclean data. But this isn’t just a data science problem. It’s a data engineering problem. And the solution lies in structured, governed data management—beginning with a robust medallion architecture. Ref.: Data Intelligence End-to-End with Azure Databricks and Microsoft Fabric | Microsoft Community Hub Raw → Silver → Gold: The Databricks Way Databricks ML Agents thrive when your data is managed through the medallion architecture, transforming raw chaos into clean, trustworthy features. For those new to Medallion architecture in Databricks, it is a structured data processing framework that progressively improves data quality through bronze (raw), silver (validated), and gold (enriched) layers, enabling scalable and reliable analytics. Raw Layer: Ingest Everything The raw layer is where we land all data, regardless of quality. It’s your unfiltered feed—logs, events, customer input, third-party APIs, CSV dumps, IoT signals, etc. e.g. CloudFronts Case studies as directly fetched from WordPress API This script automates the extraction, transformation, and optional upload of WordPress-based case study content into Azure Blob Storage as well as locally to dbfs of Databricks, making it ready for further processing (e.g., vector databases, AI ingestion, or analytics). What We’ve Done On running the above Raw to Silver Cleaning code in Databricks notebook we get a much formatted, relevant and cleaned fields specific to our requirement In Databricks Unity Catalog → Schema → Tables, it looks something like this: – Silver Layer: Structure and Clean Once ingested, we promote data to the silver layer, where transformation begins. Think of this layer as the data refinery. What happens here: Now we start to analyze trends, infer schemas, and prepare for more active feature generation. Once I have removed noise; I begin to see patterns. This script is part of a data pipeline that transforms unstructured or semi-clean data from a Raw Delta Table (knowledge_base.raw.case_studies) (Actually Silver, as we have already done much of the cleaning part in previous step) into a Gold Delta Table (knowledge_base.gold.case_studies) using Apache Spark. The transformation focuses on HTML cleaning, type parsing, and schema standardization, enabling downstream ML or BI workflow What have we done here ? Start Spark Session Initializes a Spark job using SparkSession, enabling distributed data processing within the Databricks environment. Define UDFs (User-Defined Functions) Read Data from Silver Table Loads structured data from the Delta Table: knowledge_base.raw.case_studies, which acts as the Silver Layer in the medallion architecture. Clean & Normalize Key Columns Write to Gold Table Saves the final transformed DataFrame to the Gold Layer as a Delta Table: knowledge_base.gold.case_studies, using overwrite mode and allowing schema updates. As you can see we have maintained separate schemas for Raw, Silver Gold etc; and at gold we finally managed to get a fully cleaned noiseless data suitable for our requirement. Gold Layer: Ready for ML & BI At the gold layer, data becomes a polished product, curated for specific use cases—ML models, dashboards, reports, or APIs. This is the layer where scale and accuracy become feasible, because it finally has clean, enriched, semantically meaningful data to learn from. Ref.: What is a Medallion Architecture? We can further make it more efficient to fetch with vectorization Once we reach stage and test again in the Agent Playground, we no longer face the error we had seen previously as now it is easier for the agent to retrieve gold standard vectorized data. Why This Matters for AI at Scale The difference between “good enough” and “state of the art” often hinges on data readiness. Here’s how strong data management impacts real-world outcomes: Without Medallion Architecture With Raw → Silver → Gold Data drift goes undetected Proactive schema monitoring Models degrade silently Continuous feature freshness Expensive debugging cycles Clean lineage via Delta Lake Inconsistent outputs Predictable, testable results (Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch processing to data lakes, enabling reliable analytics on massive datasets.) … Continue reading The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI
Share Story :
Using Open AI and Logic Apps to develop a Copilot agent for Elevator Pitches & Lead Qualification
In today’s competitive landscape, the ability to prepare quickly and deliver relevant, high-impact sales conversations is more critical than ever. Sales teams often spend valuable time gathering case studies, reviewing past opportunities, and preparing client-specific messaging — time that could be better spent engaging prospects. To address this, we developed “Smart Pitch” — a Microsoft Teams-integrated AI Copilot designed to equip our sales professionals with instant, contextual access to case studies, opportunity data, and procedural documentation. Challenge Sales professionals routinely face challenges such as: These hurdles not only slow down the sales cycle but also affect the consistency and quality of conversations with prospects. How It Works Platform Data Sources CloudFronts SmartPitch pulls information from the following knowledge sources: AI Integration Key Features MQL – SQL Summary Generator Users can request MQL – SQL document which contains The copilot prompts the user to provide the prospect name, contact person name, and client requirement. This is achieved via an adaptive card for better UX. HTTP Request to Logic App At Logic App we used ChatGPT API to fetch company and client information Extract the company location from the company information, and similarly, extract the industry as well. Render it to custom copilot via request to the Logic App. Use Generative answers node to display the results as required with proper formatting via prompt/Agent Instructions. Generative AI can also be instructed to directly create a formatted json based on parsed values. This formatted Json can be passed to converted to an actual Json and is used to populate a liquid template for the MQL-SQL file to dynamically create MQL-SQL for every searched company and contact person. This returns an HTML File with dynamically populated company and contact details as well as similar case studies, and work with client in similar region and industry. This triggers an auto download of the MQL-SQL created as a PDF file on your system. Content Search Users can ask questions related to – Users can ask questions like “Smart Pitch” searches SharePoint documents, public case studies, and the opportunity table to return relevant results — structured and easy to consume. –Security & Governance Integrated in Microsoft Teams, so the same authentication as Teams. Access to Dataverse and SharePoint is read-only and scoped to organizational permissions. To conclude, Smart Pitch reflects our commitment to leveraging AI to drive business outcomes. By combining Microsoft’s AI ecosystem with our internal data strategy, we’ve created a practical and impactful sales assistant that improves productivity, accelerates deal cycles, and enhances client engagement. We hope you found this blog useful, and if you would like to discuss anything, you can reach out to us at transform@cloudfonts.com
Share Story :
Using Open AI and Logic Apps to develop a Copilot agent for Elevator Pitches & Lead Qualification
In today’s competitive landscape, the ability to prepare quickly and deliver relevant, high-impact sales conversations is more critical than ever. Sales teams often spend valuable time gathering case studies, reviewing past opportunities, and preparing client-specific messaging — time that could be better spent engaging prospects. To address this, we developed “Smart Pitch” — a Microsoft Teams-integrated AI Copilot designed to equip our sales professionals with instant, contextual access to case studies, opportunity data, and procedural documentation. Challenge Sales professionals routinely face challenges such as: These hurdles not only slow down the sales cycle but also affect the consistency and quality of conversations with prospects. How It Works Platform Data Sources CloudFronts SmartPitch pulls information from the following knowledge sources: AI Integration Key Features MQL – SQL Summary Generator Users can request MQL – SQL document which contains The copilot prompts the user to provide the prospect name, contact person name, and client requirement. This is achieved via an adaptive card for better UX. HTTP Request to Logic App At Logic App we used ChatGPT API to fetch company and client information Extract the company location from the company information, and similarly, extract the industry as well. Render it to custom copilot via request to the Logic App. Use Generative answers node to display the results as required with proper formatting via prompt/Agent Instructions. Generative AI can also be instructed to directly create a formatted json based on parsed values. This formatted JSON can be passed to converted to an actual JSON and is used to populate a liquid template for the MQL-SQL file to dynamically create MQL-SQL for every searched company and contact person. This returns an HTML File with dynamically populated company and contact details as well as similar case studies, and work with client in similar region and industry. This triggers an auto download of the MQL-SQL created as a PDF file on your system. Content Search Users can ask questions related to – 1. Case Study FAQ: Helps users ask questions about client success stories and project case studies, retrieves relevant information from a knowledge source, and offers follow-up FAQs before ending the conversation. Cloudfronts official website is used for fetching Case Studies information. 2. Opportunities: Helps users inquire about past projects or opportunities, detailing client names, roles, estimated revenue and outcomes. 3. SOPs: Provides quick answers and summaries for frequently asked questions related to organizational processes and SOPs. Users can ask questions like “Smart Pitch” searches SharePoint documents, public case studies, and the opportunity table to return relevant results — structured and easy to consume. Security & Governance Integrated in Microsoft Teams, so the same authentication as Teams. Access to Dataverse and SharePoint is read-only and scoped to organizational permissions. To conclude, Smart Pitch reflects our commitment to leveraging AI to drive business outcomes. By combining Microsoft’s AI ecosystem with our internal data strategy, we’ve created a practical and impactful sales assistant that improves productivity, accelerates deal cycles, and enhances client engagement. We hope you found this blog useful, and if you would like to discuss anything, you can reach out to us at transform@cloudfonts.com