Sonntag, 28. April 2024

Lessons Learned – pdf, docx & Copilot in M365

If you save a document once as a Word / *.dox and once as a PDF / *.pdf file, Copilot provides significantly different responses to the content of the files, even if they have identical content.

Technically, the *.docx format and the *.pdf format differ significantly in their structure and details. And that is why the information available to Copilot via the M365 Graph / the index of the search is different, even though it is the same document / the same content.

Details on pdf

Texts are determined by the following attributes: font, font size, character string, color, position on the page and the type of display, e.g. italics, etc. Information such as line breaks, headers, paragraphs, indents etc. are not included. This means that information on the formatting of paragraphs, as used in *.docx files, is not included. The text itself is divided into fragments, which can be as small as individual characters or as large as an entire line. These fragments are stored randomly and are like pieces of a puzzle that form an overall picture only when they are positioned correctly.

As a result, the actual text has no structure, as it is the case in a *.docx file. A *.pdf file essentially has a header, a body and a trailer. The header contains information about the pdf file, such as the version of the pdf file format, the creation date and the author of the file. The body of the pdf file contains the actual content of the file, e.g. text, images and other media. The trailer of the pdf file contains information about the file, e.g. the size of the file, the checksum of the file and the location of the file on the hard disk.

Structure of a pdf file:

  • Header
  • Body
  • xref Table
  • Trailer
  • Incremental updates
Source and further details on the structure of a pdf file: https://www.save-emails-as-pdf.com/news/pdf-file-format-internal-document-structure-explained/

Details on docx

*.docx files are based on XML. This means they save the layout in individual XML files and structures and can contain text, images, tables, diagrams and other data. They also support extensive formatting and document structure elements such as headers, footers and page numbers.

Structure of a docx file:
  • document.xml
  • Content_Types].xml
  • docProps
  • _rels
  • text
  • Media
  • Topic
  • user-definedXml
  • Settings.xml
  • FontTable.xml
  • webSettings.xml
  • Styles.xml
  • header.xml
  • footer.xml
  • footnotes.xml
  • endnotes.xml
  • comments.xml
But the user doesn't want to know, doesn't realize and doesn't care about that. Software that should work must compensate for this. As things stand today, Copilot in M365 has potential for improvement.

An example

The following example is about the file “A quick guide to secure Office 365”. It describes details for securing Office 365 and monitoring it. The file is from 2018 and, in addition to the description of the features, it contains several tables with different levels of security settings.

The following prompt was used for this example:

Please find the document “A quick guide to secure Office 365_ENG.docx”. In the document there is a chapter “Cloud security strategy with Office 365”. Please summarize what is described in this chapter. Describe what there is about the “Default” level, the “Medium” level, the “High” level and the “Very High” level. All other chapters of the document are not relevant. Refer to the table with the levels and describe the different levels.

In both cases, the first answer is not quite what you would expect. With the *.docx file, however, it is much better than the answer based on the *.pdf files, where only one level is described:

Followup Prompts:
  • Prompt for the pdf file:
    • Copilot tells: To provide a detailed description of the “Medium,” “High,” and “Very High” levels, I would need access to the full content of the document. If you can provide the document or direct me to it, I would be able to summarize the different levels as requested.
    • Prompt: Here you can finde the document: %Link%
Then the result is much better, even Copilot suggests in the first answer that he has found and used the correct file. See screenshot above.
  • Prompt for the docx file:
    • Prompt: Please describe in more detail what the levels mean. Create a list with the details for each level.
The result is slightly better than the initial prompt, but still different from the example with the pdf file with the same content.

Conclusion

Both approaches deliver what the user wants in the end. However, the prompts and the details are very dependent on the file format. The outputs are also different depending on the file format. In the end, it is up to the user to decide which output is better or worse. In any case, they differ, even though the initial prompt and the content of the two files are the same.

 


 




Mittwoch, 24. April 2024

Size matters - Large documents and Copilot for Microsoft 365

UPDATE

Problem solved - at least an improvement is on its way!
As described in my article “Size matters - Large documents and Copilot for Microsoft 365”, Copilot is currently reaching its limits with documents longer than 20 pages / 15,000 words.
Roadmap ID 399413 now announces that this limit is to increase significantly: “Copilot in Word will be able to fully summarize documents that it could previously only partially summarize. The upper limit increases to about four times more words.
The Microsoft page linked in the article below: Keep it short and sweet: a guide on the length of documents that you provide to Copilot has also been updated. It now speaks about 80,000 words.

--

Microsoft has published an article named Keep it short and sweet: a guide on the length of documents that you provide to Copilot. It describes how Copilot for Microsoft 365 reaches its limits when it has to work with large documents or very long emails.

The reason for this is that Copilot works with data from the Microsoft Graph, which means that the search in M365 also has a role here. Documents, emails and all other content must first be indexed by the search before they are available for Copilot. At least for the search in SharePoint Online, the limits are documented: https://learn.microsoft.com/en-us/sharepoint/search-limits.

The exact limits that apply for processing by Copilot in Microsoft 365 are currently unclear. The article Keep it short and sweet: a guide on the length of documents that you provide to Copilot gives the following recommendations:

  • Shorter than 20 pages
  • Maximum of around 15,000 words

The example shows how it behaves when relevant information is after these limit recommendations. The relevant information to be used via Copilot are as followed. These are on page 49 of a Word document that contains a total of 27,208 words.


If you ask Copilot “What can you tell me about Snabales Total liabilities?” you get the following answer:
If you use Copilot in Word and ask the same question, the answer is: “This response isn't based on the document: I'm sorry, but the document does not provide any information about Snabales Total liabilities...”


One option you now have here is not to use Copilot for Microsoft 365 natively, but to create your own solution based on Azure AI-Search and Azure OpenAI. In Azure AI-Search, a vector search can be used that splits large documents into so-called chunks. This article describes the details: Chunking large documents for vector search solutions in Azure AI Search



Sharing link „People in your organization“ & Copilot for Microsoft 365

Microsoft Copilot for Microsoft 365 only surfaces organizational data to which individual users have at least view permissions. Source: https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-privacy#how-does-microsoft-copilot-for-microsoft-365-use-your-proprietary-organizational-data

The sharing function in SharePoint and OneDrive can be used to share content with users. It is then also possible to define which persons or groups are granted access and with which rights:

The following applies:
  • Anyone gives access to anyone who receives this link, whether they receive it directly from you or forwarded from someone else. This may include people outside of your organization.
  • People in <Your Organization> with the link gives anyone in your organization who has the link access to the file, whether they receive it directly from you or forwarded from someone else.
  • People with existing access can be used by people who already have access to the document or folder. It doesn't change any permissions and it doesn't share the link. Use this if you just want to send a link to somebody who already has access. 
  • Specific people gives access only to the people you specify, although other people may already have access. This may include people outside of your organization. If people forward the sharing invitation, only people who already have access to the item will be able to use the link.

For the options “Anyone ”, “People with existing access” and “Specific people”, everything described above also applies for Copilot => Microsoft Copilot for Microsoft 365 only displays organizational data for which individual users have at least display permissions.

The situation is slightly different with the option “People in <Your Organization> with the link”. The following applies here: 

Creating a People in your organization link will not make the associated file or folder appear in search results, be accessible via Copilot, or grant access to everyone within the organization. Simply creating this link does not provide organizational-wide access to the content. For individuals to access the file or folder, they must possess the link and it needs to be activated through redemption. A user can redeem the link by clicking on it, or in some instances, the link may be automatically redeemed when sent to someone via email, chat, or other communication methods. The link does not work for guests or other people outside your organization. 
Source and further details: https://learn.microsoft.com/en-us/sharepoint/deploy-file-collaboration#control-sharing

This can lead to non-transparent effects for users. For example, a user who has shared content in this way may assume that this information is now available to all users in the tenant and is therefore also accessible via Copilot. In the following example, the user Stan Laurel shares the files RefDoc.docx and SnabelesSnowball.docx via the “People in your organization” link.

Another user has received and clicked the link to the RefDoc.docx file, but not the link to the SnabelesSnowball.docx file.
This leads to the following result in Copilot although the user generally has access to both files:
Question to Copilot about the contents of the SnabelesSnowball.docx file for which the share link has not yet been clicked:
Question to Copilot about the contents of the RefDoc.docx file from which the sharing link has already been used at least once:
This effect, that a user is only granted access to the file once he has clicked on the sharing link, is also confirmed in another example. Here, Copilot is asked to create a list of all files that have been shared via the link type People in <your organization>. The file RefDoc.docx, whose sharing link has already been clicked, appears in the list. The file SnabelesSnowball.docx, for which this is not yet the case, is not mentioned.

There is also another side to this topic. If the default sharing link is “People in <Your Organization> with the link”, and this link is further promoted by people, for example by posting it in a Teams post or sending it by email, this can lead to all users suddenly receiving replies from Copilot to the content behind the link, even if this information was not intended for everyone in the company. So caution and a solid concept for dealing with the topic is necessary.
To see what kind of sharing links are used, the “Data access governance reports for SharePoint sites” can be used:

Source and further details: https://learn.microsoft.com/en-US/SharePoint/data-access-governance-reports?WT.mc_id=365AdminCSH_inproduct#sharing-links-reports