Anonymizing Healthcare Data: Building a DeID Tool for C-CDA XML Documents


What is C-CDA?
Consolidated Clinical Document Architecture (C-CDA) is a standard format for exchanging health information in the United States. C-CDA is an XML-based markup standard that allows for the exchange of patient records. C-CDA documents can contain a variety of information, including structured information like medication lists, and unstructured information like clinical notes.
How is C-CDA used?
All certified Electronic Health Records (EHRs) in the United States are required to export medical data using C-CDA. C-CDA documents are exchanged billions of times annually in the US.
What can C-CDA documents include?
C-CDA documents can encompass a broad spectrum of patient health information, ranging from details of a single clinical encounter to a comprehensive medical history. They may include procedure notes, diagnostic imaging reports, discharge summaries, demographic data, and key health metrics such as height, weight, blood pressure, and BMI.
What is De-identification?
C-CDA documents contain both clinical and demographic data. The demographic data, which includes personally identifiable information (PII), can be used to directly identify individual patients. To protect patient privacy and comply with regulatory requirements, these PII elements must be removed or replaced— a process known as de-identification.
Why De-Identify?
When C-CDA documents are shared for non-clinical uses like research or analytics, protecting patient privacy is essential. These documents often contain Patient Identifiable Information (PII) such as demographics, SSN, email addresses, and phone numbers, which can link back to individuals. This de-identification tool enables users to selectively remove the specific PII elements required for their particular use case, ensuring that only necessary data is redacted while maintaining conformity.
This ensures that sensitive patient information is safeguarded while still allowing healthcare data to be shared for non-clinical purposes.
How to De-Identify?
This DeID tool helps in the de-identification process. It is a Java-based application with a web UI that allows users to de-identify C-CDA XML documents by redacting PII.
Frontend: HTML, CSS, Bootstrap, JavaScript, jQuery, Thymeleaf
The web user-interface provides users the capability of either uploading a C-CDA XML file or C-CDA XML text.
Users can select which types of PII they want to remove from the C-CDA.
The tool processes the XML data and provides a de-identified version of the C-CDA.
Backend: Java, Spring MVC, Spring Data JPA, Tomcat
The Text Controller and Text Service components handle the processing of XML text input.
The File Controller and File Service components handle the processing of uploaded XML files.
The XML Utility component removes PII from XML data, irrespective of whether the XML is uploaded as a file or provided as plain text by the user.
The de-identified C-CDA XML files are stored temporarily on the Tomcat server for users to download. These files are removed when the Tomcat server restarts.
Note: C-CDA XML files uploaded by users that contain PII are not stored anywhere.
Database: MySQL
The MySQL DB stores the configuration and mapping information for the de-identification process. It contains the XPaths of where the PII exists, and the data to replace it with.
When Tomcat service starts, the following take place:
The configuration/mapping information is loaded into MySQL via SQL scripts (if not already present in MySQL DB).
The configuration/mapping information from MySQL is loaded into Tomcat memory. This data is then used whenever the server receives a C-CDA XML document to be de-identified.
Challenges/Learnings
Identifying PII Nodes: Initially, the approach taken was to find PII nodes programmatically, but accurately identifying them in the XML was difficult. PII data can appear in different places within a C-CDA XML, and there were no clear patterns to predict where this data would be.
Solution: Switched to using XPaths, where the tool is loaded with a set of XPaths. The tool then removes PII from these specified XPaths.
- Implementation: DOMXmlHelper.java
Using Multiple Namespaces: When XML files have a namespace, we use NamespaceContext which has namespace declarations and respective usages.
- Further Reading: XPath namespace resolution example
However, C-CDA XML can have multiple namespaces, two of which are shown below.
<ClinicalDocument xmlns = “urn:hl7-org:v3” xmlns:sdtc = “urn:hl7-org:sdtc”>
The challenge was figuring out how to resolve multiple namespaces while reading the XPaths.
Solution: Used a Map to map the namespaces with NamespaceContext prefixes.
- Implementation: CCDANamespaceContextResolver.java, DOMXmlHelper.java
WAR file: The tool was not launching after deploying the WAR file in the Tomcat webapps folder, showing a 404 error.
Solution: Extended SpringBootServletInitializer class and overridden its configure method. This uses Spring Framework’s Servlet 3.0 support and allows you to configure your application when it’s launched by the servlet container.
Implementation: DeidToolApplication.java
Further Reading: Spring Boot Documentation
Implementing Azure AD: Encountered auth issues while implementing Azure AD with Spring Security as illustrated in this article.
Solution: Spring Cloud provides a way to connect to Azure Entra (AD) services. Implemented this so the user's de-identified file is stored in the user's folder.
Implementation: AzureOAuth2LoginSecurityConfig.java, application-azure.properties
Further Reading: Microsoft Azure Documentation
GitHub Repo: https://github.com/richardvemagiri/deid-tool
Scope for Improvement
The implementation of this tool is by no means the optimal way of PII redaction. However, I developed it after noticing that my colleagues needed a way to de-identify C-CDA XML files for internal analysis. This project represents my initial effort in addressing that need.
Here are a few areas where this tool could be enhanced.
XML Processing: This tool uses a DOM Parser to process XML files provided by the user, but other parsers could also be used.
- Further Reading: https://www.baeldung.com/java-xml
PII in Unstructured Text: This tool only redacts PII from specific XML nodes where structured data exists. However, PII data can also be present in XML nodes where unstructured text exists (typically ‘notes’).
Eg: Mr. Jones, a 32 yr old male, is diagnosed with prostate cancer.PII details in such unstructured text, would need to be redacted as well, for complete anonymity of the C-CDA.
Alternate Solution: CliniDeID® (a tool from Clinacuity, Inc.) is an impressive solution for automatically de-identifying clinical notes and structured data in accordance with the HIPAA Safe Harbor method. However, during my exploration of the tool, I noticed it occasionally skipped redacting certain PII like names in the C-CDA document. Also, the conformance of the resulting XML to the C-CDA specification is not maintained. While the tool is robust overall, I wanted to have a tool where the user can choose the PII data-points (PII nodes) and parts of the document (XPaths) where de-identification is needed, and still maintaining conformance to C-CDA specifications. Hence this approach.
In either of the tools, manual oversight or additional validation may be necessary to ensure full compliance.
Further Reading: https://clinacuity.com/clinideid/
Closing Thoughts
Although this tool is a preliminary attempt and has its limitations, it provides a practical solution for redacting PII data from C-CDA documents. The tool, leveraging Java and Spring technologies, offers a flexible and configurable approach to de-identification. Future enhancements could focus on improving XML processing performance and addressing PII in unstructured text. Feedback and collaboration are welcome to refine and expand this tool's capabilities, to meet the ever-evolving needs of healthcare data privacy.
Feel free to leave a comment below—I'd love to hear your thoughts!
GitHub Repo: https://github.com/richardvemagiri/deid-tool
Happy learning!
References
Deidentification Guidance: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
Create App on Azure Entra ID (Azure AD): Spring Boot Security With Azure Active Directory | OICD | Oauth2 | JavaTechie
Spring Boot login example using MySQL: https://www.javaguides.net/2018/10/user-registration-module-using-springboot-springmvc-springsecurity-hibernate5-thymeleaf-mysql.html
Spring @Value annotation (use cases): https://www.baeldung.com/spring-value-annotation
Guide to uploading files: https://spring.io/guides/gs/uploading-files
XPath NameSpace Resolution: https://howtodoinjava.com/java/xml/xpath-namespace-resolution-example
Calling jQuery function on form submit: https://stackoverflow.com/questions/8999210/how-do-i-call-a-jquery-function-on-submitting-a-form
Spring & AJAX: https://spring.io/blog/2010/01/25/ajax-simplifications-in-spring-3-0
Rendering Thymeleaf Fragments with AJAX requests: https://stackoverflow.com/questions/20982683/spring-mvc-3-2-thymeleaf-ajax-fragments
XML Libs: https://github.com/eugenp/tutorials/tree/master/xml
jQuery AJAX progress bar:
Subscribe to my newsletter
Read articles from Richard Vemagiri directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
