Sonntag, 28. April 2024

Lessons Learned – pdf, docx & Copilot in M365

If you save a document once as a Word / *.dox and once as a PDF / *.pdf file, Copilot provides significantly different responses to the content of the files, even if they have identical content.

Technically, the *.docx format and the *.pdf format differ significantly in their structure and details. And that is why the information available to Copilot via the M365 Graph / the index of the search is different, even though it is the same document / the same content.

Details on pdf

Texts are determined by the following attributes: font, font size, character string, color, position on the page and the type of display, e.g. italics, etc. Information such as line breaks, headers, paragraphs, indents etc. are not included. This means that information on the formatting of paragraphs, as used in *.docx files, is not included. The text itself is divided into fragments, which can be as small as individual characters or as large as an entire line. These fragments are stored randomly and are like pieces of a puzzle that form an overall picture only when they are positioned correctly.

As a result, the actual text has no structure, as it is the case in a *.docx file. A *.pdf file essentially has a header, a body and a trailer. The header contains information about the pdf file, such as the version of the pdf file format, the creation date and the author of the file. The body of the pdf file contains the actual content of the file, e.g. text, images and other media. The trailer of the pdf file contains information about the file, e.g. the size of the file, the checksum of the file and the location of the file on the hard disk.

Structure of a pdf file:

  • Header
  • Body
  • xref Table
  • Trailer
  • Incremental updates
Source and further details on the structure of a pdf file: https://www.save-emails-as-pdf.com/news/pdf-file-format-internal-document-structure-explained/

Details on docx

*.docx files are based on XML. This means they save the layout in individual XML files and structures and can contain text, images, tables, diagrams and other data. They also support extensive formatting and document structure elements such as headers, footers and page numbers.

Structure of a docx file:
  • document.xml
  • Content_Types].xml
  • docProps
  • _rels
  • text
  • Media
  • Topic
  • user-definedXml
  • Settings.xml
  • FontTable.xml
  • webSettings.xml
  • Styles.xml
  • header.xml
  • footer.xml
  • footnotes.xml
  • endnotes.xml
  • comments.xml
But the user doesn't want to know, doesn't realize and doesn't care about that. Software that should work must compensate for this. As things stand today, Copilot in M365 has potential for improvement.

An example

The following example is about the file “A quick guide to secure Office 365”. It describes details for securing Office 365 and monitoring it. The file is from 2018 and, in addition to the description of the features, it contains several tables with different levels of security settings.

The following prompt was used for this example:

Please find the document “A quick guide to secure Office 365_ENG.docx”. In the document there is a chapter “Cloud security strategy with Office 365”. Please summarize what is described in this chapter. Describe what there is about the “Default” level, the “Medium” level, the “High” level and the “Very High” level. All other chapters of the document are not relevant. Refer to the table with the levels and describe the different levels.

In both cases, the first answer is not quite what you would expect. With the *.docx file, however, it is much better than the answer based on the *.pdf files, where only one level is described:

Followup Prompts:
  • Prompt for the pdf file:
    • Copilot tells: To provide a detailed description of the “Medium,” “High,” and “Very High” levels, I would need access to the full content of the document. If you can provide the document or direct me to it, I would be able to summarize the different levels as requested.
    • Prompt: Here you can finde the document: %Link%
Then the result is much better, even Copilot suggests in the first answer that he has found and used the correct file. See screenshot above.
  • Prompt for the docx file:
    • Prompt: Please describe in more detail what the levels mean. Create a list with the details for each level.
The result is slightly better than the initial prompt, but still different from the example with the pdf file with the same content.

Conclusion

Both approaches deliver what the user wants in the end. However, the prompts and the details are very dependent on the file format. The outputs are also different depending on the file format. In the end, it is up to the user to decide which output is better or worse. In any case, they differ, even though the initial prompt and the content of the two files are the same.

 


 




Keine Kommentare:

Kommentar veröffentlichen