Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search - CloudFronts

Automating Document Vectorization from SharePoint Using Azure Logic Apps and Azure AI Search

In modern enterprises, documents stored across platforms like SharePoint often remain underutilized due to the lack of intelligent search capabilities. What if your organization could automatically extract meaning from those documents—turning them into searchable vectors for advanced retrieval systems? That’s exactly what we’ve achieved by integrating Azure Logic Apps with Azure AI Search.

Workflow Overview

Whenever a user uploads a file to a designated SharePoint folder, a scheduled Azure Logic App is triggered to:

  • >Replicate the folder structure in an Azure Blob Storage container.
  • >Copy all uploaded documents into their respective blob directories.

Once stored, a scheduled Azure Cognitive Search Indexer kicks in. This indexer:

  • >Processes all documents using a predefined skillset (e.g., extracting text, layout, and images).
  • >Vectorizes the content using Azure OpenAI Embedding models, turning raw files into searchable embeddings.

Technologies / resources used: –
-> SharePoint: A common document repository for enterprise users, ideal for collaborative uploads.

-> Azure Logic Apps: Provides low-code automation to monitor SharePoint for changes and sync files to Blob Storage. It ensures a reliable, scheduled trigger mechanism with minimal overhead.

-> Blob Storage: Serves as the staging ground where documents are centrally stored for indexing—cheaper and more scalable than relying solely on SharePoint connectors.

-> Azure AI Search (Cognitive Search): The intelligence layer that runs a skillset pipeline to extract, transform, and vectorize the content, enabling semantic search, multimodal RAG (Retrieval Augmented Generation), and other AI-enhanced scenarios.

Why Not Vectorize Directly from SharePoint?

  • While Azure AI Search supports SharePoint as a data source, it currently lacks native support for advanced skillsets like custom AI skills, document chunking, and OpenAI-based embedding generation when indexing directly from SharePoint. Moreover, file-level access limitations, latency issues, and format inconsistencies in SharePoint-hosted files can hinder preprocessing.
    By first syncing documents to Azure Blob Storage, we gain:
    • >Full control over file structures and formats.
    • >Support for all AI enrichment capabilities, including multi-skill pipelines, image extraction, and vector generation.
    • >Improved reliability and performance during indexing.
    • >In essence, Blob Storage acts as a clean staging layer that unlocks the full power of Azure AI Search.

Reference:-
1. https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online
2. https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage


How to achieve this?

Stage 1: – Logic App to sync Sharepoint files to blob


Firstly, create a designated Sharepoint directory to upload the required documents for vectorization.


Then create the logic app to replicate the files along with it’s format and properties to the associated blob storage –

1] Assign the site address and the directory name where the documents are uploaded in Sharepoint – In the trigger action “When an item is created or modified”.

2] Assign a recurrence frequency, start time and time zone to check/verify for new documents and keep the blob container updated.

3] Add an action component – “Get file content using path”; and dynamically provide the full path (includes file extension), from the trigger

4] Finally, add an action to create blobs in the designated container that would be vectorized – provide the storage acc. name, directory path, the name of blob (Select to dynamically get the file name with extension for the trigger), blob content (from the get file content action).


5] On successfully saving & running this logic app, either manually or on trigger, the files are replicated in it’s exact form to the blob storage.



Stage 2 :- Azure AI Search resource to vectorize the files in blob storage

In Azure Portal (Home – Microsoft Azure), search for Azure AI Search service, and provide the necessary details, based on your requirement select a pricing tier.



Once resource is successfully created, select “Import & vectorize data”


From the 2 options – RAG and Multimodal RAG Index, select the latter one.

RAG combines a retriever (to fetch relevant documents) with a generative language model (to generate answers) using text-only data.

Multimodal RAG extends the RAG architecture to include multiple data types such as text, images, tables, PDFs, diagrams, audio, or video.

Workflow:

  1. Query → searches a multimodal vector store (e.g., text and image embeddings)
  2. Retrieved content (e.g., text chunks + image descriptions) → fed to a multimodal model (like GPT-4o or Gemini)
  3. Model → combines reasoning across modalities to generate answers

Now follow the steps and provide the necessary details for the index creation

Enable deletion tracking, to remove the records of deleted documents from the index

Provide a document intelligence resource to enable OCR, and to get location metadata for multiple document types.


Select image verbalization (to verbalize text in images) or multimodal embedding to vectorize the whole image.


Assign the LLM model for generating the embeddings for the text/images


provide an image output location, to store images extracted from the files


Assign a schedule to refresh the indexer and to keep the search index up to date with new documents.


Once successfully created, search keywords in the search explorer of the index, to verify the vectorization, the results are provided based on it’s relevance and score/distance, to the user’s search query.


Let us test this index in Custom Copilot Agent , by importing this index as an azure ai search knowledge source.


On fetching details of certain document specific information, the index is searched for the most appropriate information, and the result is rendered in readable format by generative AI. 



We hope you found this blog useful, and if you would like to discuss anything, you can reach out to us at transform@cloudfronts.com.


Share Story :

SEARCH BLOGS :

FOLLOW CLOUDFRONTS BLOG :


Secured By miniOrange