DataBricks Archives -

Category Archives: DataBricks

Inside SmartPitch: How CloudFronts Built an Enterprise-Grade AI Sales Agent Using Microsoft and Databricks Technologies 

Why SmartPitch? – The Idea and Pain Point  The idea for SmartPitch came directly from observing the day-to-day struggles of sales and pre-sales teams. Every Marketing Qualified Lead (MQL) to Sales Qualified Lead (SQL) conversion required hours of manual work: searching through documents stored in SharePoint, combing through case studies, aligning them with solution areas, and finally packaging them into a client-ready pitch deck.  The reality was that documents across systems—SharePoint, Dynamics 365, PDFs, PPTs—remained underutilized because there was no intelligent way to bring them together.   Sales teams often relied on tribal knowledge or reused existing decks with limited personalization.  We asked: What if a sales assistant could automatically pull the right case studies, map them to solution areas, and draft an elevator pitch on demand, in minutes?  That became the SmartPitch vision: an AI-powered agent that:  As a result of this product, it has helped us reduce pitch creation time by 70%.   2. The First Prototype – Custom Copilot Studio  Our first step was to build SmartPitch using Custom Copilot Studio. It gave us a low-code way to experiment with conversational flows, integrate with Azure AI Search, and provide sales teams with a chat interface.  1. Knowledge Sources Integration  2. Data Flow  3. Conversational Flow Design  4. Integration and Security  5. Technical Stack  6. Business Process Enablement  7. Early Prototypes  With Custom Copilot, we were able to:  We successfully demoed these early prototypes in Zurich and New York. They showed that the idea worked but they also revealed serious limitations.  3. Challenges in Custom Copilot  Despite proving the concept, Custom Copilot Studio had critical shortcomings:  Lacked support for model fine-tuning or advanced RAG customization.  However, incorporating complex external APIs or custom workflows was difficult.  This limitation meant SmartPitch, in its Copilot form, couldn’t scale to meet enterprise standards.  4. Rebuilding in Azure AI Foundry – Smarter, Extensible, Connected  The next phase was Azure AI Foundry, Microsoft’s enterprise AI development platform. Unlike Copilot Studio, AI Foundry gave us:  Extending SmartPitch with Logic Apps  One of the biggest upgrades was the ability to integrate Azure Logic Apps as external tools for the agent. This allowed SmartPitch to:  This modular approach meant we could add new functionality simply by publishing a new Logic App. No redeployment of SmartPitch was required.  Automating Document Vectorization  We also solved one of the biggest bottlenecks—document ingestion and retrieval—by building a pipeline for automatic document vectorization from SharePoint:  This allowed SmartPitch to search across text, images, tables, and PDFs, providing relevant answers instead of keyword matches.  But There Were Limitations  Even with these improvements, we hit roadblocks:  At this point, we realized the true bottleneck wasn’t the agent itself, it was the quality of the data powering it.  5. Bad Data, Governance, and the Medallion Architecture  SmartPitch’s performance was only as good as the data it retrieved from. And much of the enterprise data was dirty: duplicate case studies, outdated documents, inconsistent file formats.  This led to irrelevant or misleading responses in pitches.  To address this, we turned to Databricks’ Unity Catalog and Medallion Architecture:  You can read our post on building a clean data foundation with Medallion Architecture [Link]   Now, every result SmartPitch surfaced could be trusted, audited, and tied to a governed source.  6. SmartPitch in Mosaic AI – The Final Evolution  The last stage was migrating SmartPitch into Databricks Mosaic AI, part of the Lakehouse AI platform. This was where SmartPitch matured into an enterprise-grade solution.  What We Gained in Mosaic AI:  In Mosaic AI, SmartPitch wasn’t just a chatbot it became a data-native enterprise sales assistant:  From these, we came to know the following differences between agent development in AI Foundry & DataBricks Mosaic AI –    Attribute / Aspect  Azure AI Foundry  Mosaic AI  Focus  Developer and Data Scientist  Data Engineers, Analysts, and Data Scientists  Core Use Case  Create and manage your own AI agent  Build, experiment, and deploy data-driven AI models with analytics + AI workflows  Interface  Code-first (SDKs, REST APIs, Notebooks)  No-code/low-code UI + Notebooks + APIs  Data Access  Azure Blob, Data Lake, vector DBs  Native integration with Databricks Lakehouse, Delta Lake, Unity Catalog, vector DBs  MCP Server  Only custom MCP servers supported; built-in option complex  Native MCP support with Databricks ecosystem; simpler setup  Models  90 models available  Access to open-source + foundation models (MPT, Llama, Mixtral, etc.) + partner models  Model Customization  Full model fine-tuning, prompt engineering, RAG  Fine-tuning, instruction tuning, RAG, model orchestration  Publish to Channels  Complex (Azure Bot SDK + Bot Framework + App Service)  Direct integration with Databricks workflows, APIs, dashboards, and third-party apps  Agent Update  Real-time updates in Microsoft Teams  Updates deployed via Databricks workflows; versioning and rollback supported  Key Capabilities  Prompt flow orchestration, RAG, model choice, vector search, CICD pipelines, Azure ML & responsible AI integration  Data + AI unification (native to Lakehouse), RAG with Lakehouse data, multi-model orchestration, fine-tuning, end-to-end ML pipelines, secure governance via Unity Catalog, real-time deployment  Key Components  Workspace & agent orchestration, 90+ models, OpenAI pay-as-you-go or self-hosted, security via Azure identity  Mosaic AI Agent Framework, Model Serving, Fine-Tuning, Vector Search, RAG Studio, Evaluation & Monitoring, Unity Catalog Integration  Cost / License  Vector DB: external, Model Serving: token-based pricing (GPT-3.5, GPT-4), Fine-tuning: case-by-case, Total agent cost variable (~$5k–$7k+/month)  Vector Search: $605–$760/month for 5M vectors, Model Serving: $90–$120 per million tokens, Fine-Tuning Llama 3.3: $146–$7,150, Managed Compute built into DBU usage, End-to-end AI Agent ~$5k–$7k+/month  Use Cases / Capabilities  Agents intelligent, can interact/modify responses; single AI search per agent; infrastructure setup required; custom MCP server registration  Agents intelligent, interact/modify responses; AI search via APIs (Google/Bing); in-built MCP server; complex infrastructure; slower responses as results batch sent together  Development Approach  Low-code, faster agent creation, SDK-based, easier experimentation  Manual coding using MLflow library, more customization, API integration, higher chance of errors, slower build  Models Comparison  90 models, Azure OpenAI (GPT-3.5, GPT-4), multi-modal  ~10 base models, OSS & partner models (Llama, Claude, Gemma), many models don’t support tool usage  Knowledge Source  One knowledge source of each type (adding new replaces previous)  No limitation; supports data cleaning via Medallion Architecture; SQL-only access inside agent; Spark/PySQL not supported in agent  Memory / Context Window  8K–128K tokens (up to 1M for GPT-4.1)  Moderate, not specified  Modalities  Text, code, vision, audio (some models)  Likely text-only  Special Enhancements  Turbo efficiency, reasoning, tool calling, multimodal  Varies per model (Llama, Claude, Gemma architectures)  Availability  Deployed via Azure AI Foundry  Through Databricks platform  Limitations  Only one knowledge source of each type, infrastructure complexity for MCP server  No multi-modal Spark/PySQL access, slower batch responses, limited model count, high manual development  7. Lessons Learned:  … Continue reading Inside SmartPitch: How CloudFronts Built an Enterprise-Grade AI Sales Agent Using Microsoft and Databricks Technologies 

Share Story :

Before You Add AI, Fix Your Foundations: How to Prepare Your Data for Intelligent Tools

Everyone wants AI. Few are ready for it.  The question isn’t “When do we start?” but “Are we prepared to get it right?”  Because switching on Copilots without fixing your foundations doesn’t accelerate you. it amplifies chaos.  This article will cover how to fix your foundations for AI so that the AI tools you deploy are accurate and reliable.  Challenges of deploying AI Directly  Some of the common challenges of directly deploying AI on top of your business applications are –   And these issues just render the AI implementation as a failure immediately dismissing trust in using AI at all.  But these challenges can be overcome once the foundations of AI are in place which we’ll discuss in the next section.  Foundation of AI  At CloudFronts, we call this the 3 Pillars of AI Readiness:  Here’s how I sum up the foundation of the systems for AI –  For example, when CloudFronts helped Tinius Olsen modernize their systems, the focus wasn’t just technical uplift. It was about ensuring every business process was cloud-ready so AI models could actually trust the data.   Upgrading from legacy systems And this is the foundation that needs to be had before AI can be implemented at your organization.  Data & AI Maturity Curve by Databricks  Given the above foundations in place for your AI Adoption strategy and choosing the right framework for your implementation, the Data & AI Maturity Curve shown below can be referenced to see where your organization is on the curve and where do you want to get to –   On a high level, the foundation will get you to look back at the data and see what has happened in the past and AI tools can help you get this information accurately.  Further, once trust is established, actions like making the AI predict the future state of operations, prescribe steps and even take decisions on our behalf can be achieved – provided you really want that to happen. It might be too soon just yet.  To conclude, AI success = Foundations × Trust.  Without modern systems, connected data, and governed access, AI is just noise.  But with these in place, every AI tool you deploy whether predictive analytics or Copilots becomes an accelerator for decision-making, not a distraction.  Before you deploy AI, fix your foundations. If you’re serious about making AI a trusted accelerator not a costly experiment start with modernization, connection, and governance. At CloudFronts, we help enterprises build these foundations with confidence. Let’s connect over our email: Transfrom@cloudfronts.com  

Share Story :

From Raw to Reliable: Cleaning Data at Scale with Azure Databricks

Are you struggling with messy spreadsheets full of duplicates, missing values, and inconsistent records? You’re not alone. Data professionals spend nearly 80% of their time cleaning and preparing data before any real analysis begins. The truth is simple: without clean data, business reports are unreliable, AI models fail, and decision-making slows down. In this blog, we’ll show you how Azure Databricks makes data cleaning easier, faster, and scalable—turning raw inputs into reliable insights with just a few lines of code. Why Clean Data Matters For business leaders, whether you’re a Team Lead, CTO, or CEO, clean data directly impacts growth: With Azure Databricks, you get a cloud-native, Spark-powered platform that handles big data at scale while integrating seamlessly with Azure Data Lake, Synapse, and Power BI. Practical Example: Cleaning a Sales Dataset in Azure Databricks Imagine you have a raw CSV file in Azure Data Lake with customer sales data: Issues in the data: Solution with PySpark in Databricks: Output after cleaning: CustomerID Name Country Sales 101 Alice USA 500 102 Bob USA 300 103 Unknown UK 450 104 David India 0 With just a few lines of Spark code, the dataset is now ready for reporting, visualization, or machine learning. To conclude, clean data is the foundation of every reliable business insight. With Azure Databricks, you can automate messy, manual processes and create repeatable, scalable pipelines that keep your data reliable—no matter how fast your business grows. ✅ Start small: try building a simple cleaning pipeline in Azure Databricks today.✅ Save time: focus more on insights, less on manual data prep.✅ Scale with confidence: as your data grows, Databricks grows with you. 👉 Want to take the next step? Explore how Databricks integrates with Power BI for real-time dashboards or with MLflow for machine learning pipelines. Stay tuned for our next post where we’ll cover these use cases in detail. ✨ With Databricks, your journey from raw to reliable data starts today. Contact us today at Transform@cloudfronts.com to get started. To learn more about functionalities of DataBricks and other Azure AI services, please refer to my other blogs from the links given below: – 1] The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI – CloudFronts 2] Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search – CloudFronts 3] Using Open AI and Logic Apps to develop a Copilot agent for Elevator Pitches & Lead Qualification – CloudFronts

Share Story :

Setting Up Unity Catalog in Databricks for Centralized Data Governance

The fastest way to lose control of enterprise data? Managing governance separately across workspaces. Unity Catalog solves this with one centralized layer for security, lineage, and discovery. Data governance is crucial for any organization looking to manage and secure its data assets effectively. Databricks’ Unity Catalog is a centralized solution that provides a unified interface for managing access control, auditing, data lineage, and discovery. This blog will guide you through the process of setting up Unity Catalog in your Databricks workspace. What is Unity Catalog? Unity Catalog is Databricks’ answer to centralized data governance. It enables organizations to enforce standards-compliant security policies, apply fine-grained access controls, and visualize data lineage across multiple workspaces. It ensures compliance and promotes efficient data management. Key Features: 1] Standards-Compliant Security: ANSI SQL-based access policies that apply across all workspaces in a region. 2] Fine-Grained Access Control: Support for row- and column-level permissions. 3] Audit Logging: Tracks who accessed what data and when. 4] Data Lineage: Provides visualization of data flow and dependencies. Unity Catalog Object Hierarchy Before diving into the setup, it’s important to understand the hierarchical structure of Unity Catalog: 1] Catalogs: The top-level container (e.g., Production, Development) that represents an organizational unit or environment. 2] Schemas: Logical groupings of tables, views, and AI models within a catalog. 3] Tables and Views: These include managed tables fully governed by Unity Catalog and external tables referencing existing cloud storage. Here is the procedure to setup a Unity Catalog Metastore in association with Azure Storage, as I have done for one of our products (SmartPitch Sales & Marketing Agent) – 1] First create a storage account  with primary service being – “Azure Blob Storage or Azure Data Lake Storage Gen 2”; Performance and Redundancy can be chosen based on the requirement for which the DataBricks service is being used.Here for my Mosaic AI Agent, I have used Locally Redundant Storage & Data Lake Gen 2 2] Once the storage account is created, ensure that you have enabled “Hierarchical Namespace” When creating a Unity Catalog metastore with Azure Blob Storage, Hierarchical Namespace (HNS) is required because Unity Catalog needs: a] Folder-like structure to organize catalogs, schemas, and tables. b] Atomic operations (rename, move, delete) on directories and files. c] POSIX-style access controls for fine-grained permissions. d] Faster metadata handling for lineage and governance. HNS turns Azure Blob into ADLS Gen2, which supports these features. 3] Upload any Raw/Unclean files to your metastore folder in the blob storage, which would be required for your use in DataBricks. 4] Create a Unity Catalog Connector in Azure Portal and assign it “Storage Blob Data Contributor” Role . 5] Assign CORS (Cross-Origin Resource Sharing) settings for that storage account. Why this is necessary: In short: Without configuring CORS, Databricks cannot communicate with your storage container to read/write managed tables, schema metadata, or logs. 6] Generate SAS Token   7] Navigate to your Workspace and select “Manage Account” – this should be done from the account admin.  8] Select Catalog tab on the left and then click “Create Metastore”    9] Assign a Name, Region (Same as Workspace), The path to the storage account, and the connector id. 10] Once the Metastore is created, assign it to a workspace .  11] Once this is done, the catalogs and the schemas, and tables in within it can be created. How does Unity Catalog differ from Hive Metastore ? Feature Hive Metastore Unity Catalog Scope Workspace or cluster-specific Centralized, spans multiple workspaces and regions Architecture Single metastore tied to Spark/Hive Cloud-native service integrated with Databricks Object Hierarchy Databases → Tables → Partitions Catalogs → Schemas → Tables/Views/Models Data Assets Supported Tables, views Tables, views, files, ML models, dashboards Security Basic GRANT/DENY at database/table level Fine-grained, ANSI SQL–based (catalog, schema, table, column, row) Lineage Not available Built-in lineage and impact analysis Auditing Limited or external Integrated audit logs across workspaces Storage Management Points to storage locations; no governance Manages external and managed tables with governance Cloud Integration Primarily on cluster storage or external path Secure integration with ADLS Gen2, S3, GCS Permissions Model Spark SQL statements Attribute- and role-based access, unified policies Use Cases Basic metadata store for Spark/Hive workloads Enterprise-wide data governance, sharing, and compliance To conclude, Unity Catalog is the next-generation governance and metadata solution for Databricks, designed to give organizations a single, secure, and scalable way to manage data and AI assets. Unlike the older Hive Metastore, it centralizes control across multiple workspaces, supports fine-grained access policies, delivers built-in lineage and auditing, and integrates seamlessly with cloud storage like Azure Data Lake, S3, or GCS. When setting it up, key steps include: 1] Creating a metastore and linking it to your workspaces. 2] Enabling hierarchical namespace on Azure storage for folder-level security and operations. 3] Configuring CORS to allow Databricks domains to interact with storage. 4] Defining catalogs, schemas, and tables for structured governance. By implementing Unity Catalog, you ensure stronger security, better compliance, and faster data discovery, making your Databricks environment enterprise-ready for analytics and AI. Business Outcomes of Unity Catalog By implementing Unity Catalog, organizations can achieve: Why now? As data volumes and regulatory requirements grow, organizations can no longer rely on fragmented or legacy governance tools. Unity Catalog offers a future-proof foundation for unified data management and AI governance—essential for any modern data-driven enterprise. At CloudFronts, we help enterprises implement and optimize Unity Catalog within Databricks to ensure secure, compliant, and scalable data governance for enterprise data governance.Book a consultation with our experts to explore how Unity Catalog can simplify compliance and boost productivity for your teams.Contact us today at Transform@cloudfronts.com to get started. To learn more about functionalities of DataBricks and other Azure AI services, please refer to my other blogs from the links given below: – 1] The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI – CloudFronts 2] Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search – CloudFronts 3] Using Open AI and Logic Apps to develop a Copilot agent for … Continue reading Setting Up Unity Catalog in Databricks for Centralized Data Governance

Share Story :

The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI

. Key Takeaways 1. Bad data kills ML performance — duplicate, inconsistent, missing, or outdated data can break production models even if training accuracy is high.2. Databricks Medallion Architecture (Raw → Silver → Gold) is essential for turning raw chaos into structured, trustworthy datasets ready for ML & BI.3. Raw layer: capture all data as-is; Silver layer: clean, normalize, and standardize; Gold layer: curate, enrich, and prepare for modeling or reporting.4. Structured pipelines reduce compute cost, improve model reliability, and enable proactive monitoring & feature freshness.5. Data quality is as important as algorithms — invest in transformation and governance for scalable, accurate AI. While developing a Machine Learning Agent for Smart Pitch (Sales/Presales/Marketing Assistant chatbot) within the Databricks ecosystem, I’ve seen firsthand how unclean, inconsistent, and incomplete data can cripple even the most promising machine learning initiatives. While much attention is given to models and algorithms, what often goes unnoticed is the silent productivity killer lurking in your pipelines: bad data. In the chase to build intelligent applications, developers often overlook the foundational layer—data quality. Here’s the truth: It is difficult to scale machine learning on top of chaos. That’s where the Raw → Silver → Gold data transformation framework in Databricks becomes not just useful, but also essential. The Hidden Cost of Bad Data Imagine you’ve built a high-performance ML model. It’s accurate during training, but underperforms in production. Why?Here’s what I frequently detect when I scan input pipelines:– Duplicate records– Inconsistent data types– Missing values– Outdated information– Schema drift (Schema drift introduces malformed, inconsistent, or incomplete data that breaks validation rules and compromises downstream processes.)– Noise in DataThese issues inflate compute costs, introduce bias, and produce unstable predictions, resulting in wasted hours debugging pipelines, increased operational risks, and eroded trust in your AI outputs. Ref.: Data Poisoning: A Silent but Deadly Threat to AI and ML Systems | by Anya Kondamani | nFactor Technologies | Medium As you can see in this example – Despite successfully fetching data, it was unable to render it, due to hitting maximum request limit, as bad data was involved, thus AI Agents face issues if directly brute forced with Raw/Unclean data. But this isn’t just a data science problem. It’s a data engineering problem. And the solution lies in structured, governed data management—beginning with a robust medallion architecture. Ref.: Data Intelligence End-to-End with Azure Databricks and Microsoft Fabric | Microsoft Community Hub Raw → Silver → Gold: The Databricks Way Databricks ML Agents thrive when your data is managed through the medallion architecture, transforming raw chaos into clean, trustworthy features. For those new to Medallion architecture in Databricks, it is a structured data processing framework that progressively improves data quality through bronze (raw), silver (validated), and gold (enriched) layers, enabling scalable and reliable analytics. Raw Layer: Ingest Everything The raw layer is where we land all data, regardless of quality. It’s your unfiltered feed—logs, events, customer input, third-party APIs, CSV dumps, IoT signals, etc. e.g. CloudFronts Case studies as directly fetched from WordPress API This script automates the extraction, transformation, and optional upload of WordPress-based case study content into Azure Blob Storage as well as locally to dbfs of Databricks, making it ready for further processing (e.g., vector databases, AI ingestion, or analytics). What We’ve Done On running the above Raw to Silver Cleaning code in Databricks notebook we get a much formatted, relevant and cleaned fields specific to our requirement In Databricks Unity Catalog → Schema → Tables, it looks something like this: – Silver Layer: Structure and Clean Once ingested, we promote data to the silver layer, where transformation begins. Think of this layer as the data refinery. What happens here: Now we start to analyze trends, infer schemas, and prepare for more active feature generation. Once I have removed noise; I begin to see patterns. This script is part of a data pipeline that transforms unstructured or semi-clean data from a Raw Delta Table (knowledge_base.raw.case_studies) (Actually Silver, as we have already done much of the cleaning part in previous step) into a Gold Delta Table (knowledge_base.gold.case_studies) using Apache Spark. The transformation focuses on HTML cleaning, type parsing, and schema standardization, enabling downstream ML or BI workflow What have we done here ? Start Spark Session Initializes a Spark job using SparkSession, enabling distributed data processing within the Databricks environment. Define UDFs (User-Defined Functions) Read Data from Silver Table Loads structured data from the Delta Table: knowledge_base.raw.case_studies, which acts as the Silver Layer in the medallion architecture. Clean & Normalize Key Columns Write to Gold Table Saves the final transformed DataFrame to the Gold Layer as a Delta Table: knowledge_base.gold.case_studies, using overwrite mode and allowing schema updates. As you can see we have maintained separate schemas for Raw, Silver Gold etc; and at gold we finally managed to get a fully cleaned noiseless data suitable for our requirement. Gold Layer: Ready for ML & BI At the gold layer, data becomes a polished product, curated for specific use cases—ML models, dashboards, reports, or APIs. This is the layer where scale and accuracy become feasible, because it finally has clean, enriched, semantically meaningful data to learn from. Ref.: What is a Medallion Architecture? We can further make it more efficient to fetch with vectorization  Once we reach stage and test again in the Agent Playground, we no longer face the error we had seen previously as now it is easier for the agent to retrieve gold standard vectorized data. Why This Matters for AI at Scale The difference between “good enough” and “state of the art” often hinges on data readiness. Here’s how strong data management impacts real-world outcomes: Without Medallion Architecture With Raw → Silver → Gold Data drift goes undetected Proactive schema monitoring Models degrade silently Continuous feature freshness Expensive debugging cycles Clean lineage via Delta Lake Inconsistent outputs Predictable, testable results (Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified batch processing to data lakes, enabling reliable analytics on massive datasets.) … Continue reading The Hidden Cost of Bad Data:How Strong Data Management Unlocks Scalable, Accurate AI

Share Story :

SEARCH BLOGS:

FOLLOW CLOUDFRONTS BLOG :


Secured By miniOrange