Extracting Actionable Threat Intelligence from Cybersecurity Reports – Our Journey at HACK IITK Global Security Hackathon 2025

Muskan ChalanaMuskan Chalana
5 min read

Introduction

In an era where cyber threats are evolving at an unprecedented pace, the ability to extract and analyze threat intelligence efficiently is more crucial than ever. At Global Security Hackathon 2025, My teammates and I set out on a journey to build a solution that could transform unstructured cybersecurity reports into structured, actionable intelligence. Our project wasn’t just about extracting data; it was about bridging the gap between raw security intelligence and real-world decision-making.

From 7000+ participants and 4000+ teams, we made it to the top 200 finalists, presenting our solution before a distinguished jury of cybersecurity experts, industry leaders, and AI pioneers. Hosted by C3iHub, IIT Kanpur, and supported by industry leaders like BEL, SBI, AWS, SecureDApp, and Saptang Labs, this global security hackathon served as the perfect platform to push the boundaries of cybersecurity innovation.


The Problem Statement

Organizations and security researchers rely on detailed cybersecurity reports to understand ongoing threats. However, these reports are often lengthy, unstructured, and require manual effort to extract key insights. Our challenge was to develop a system that could:

  1. Extract IoCs (Indicators of Compromise) – Domains, IPs, hashes, and URLs

  2. Identify TTPs (MITRE ATT&CK Techniques) – Recognize attacker behaviors and methods

  3. Detect Threat Actors – Link reports to known adversaries

  4. Retrieve Malware Details – Cross-verify hash data and classify malware

  5. Identify Targeted Entities – Extract affected organizations, industries, and victims


Our Approach – Breaking Down the Methodology

1️⃣ IoC Extraction: Regex, NER, and OSINT Validation

To extract indicators of compromise (IoCs) effectively, we combined:

  • Regular Expressions (Regex) & Named Entity Recognition (NER): Regex efficiently captured structured IoCs (IPs, domains, URLs, hashes), while NER identified textual mentions of artifacts.

  • Hash Identification & Classification: Using HashID, we categorized extracted hashes into MD5, SHA1, SHA256, TLSH, and SSDEEP.

  • Live Domain Validation: Instead of relying on static lists, we dynamically fetched IANA TLD updates to validate extracted domains.

  • OSINT Enrichment: Extracted hashes were cross-referenced with VirusTotal, Hybrid Analysis, and MalwareBazaar, ensuring actionable intelligence rather than mere text extraction.

2️⃣ TTP Mapping: MITRE ATT&CK API & Semantic Similarity

Understanding attacker behavior is key to proactive defense. We implemented:

  • Live MITRE ATT&CK Integration: Our system fetched the latest TTPs (Techniques, Tactics, and Procedures) dynamically from MITRE’s CTI database.

  • Semantic Similarity with Sentence-BERT (SBERT): Instead of basic keyword matching, we computed similarity scores between report text and MITRE TTP descriptions to improve accuracy.

  • Contextual Technique Extraction: By analyzing sentence structures and attack narratives, we ensured precise mapping of TTPs to corresponding attack stages.

3️⃣ Threat Actor Identification: MITRE Knowledge Graph & NLP Refinement

To correctly attribute cyberattacks to known adversaries:

  • MITRE Threat Actor Dataset: We fetched real-time data on adversaries from MITRE’s Enterprise ATT&CK repository.

  • Cross-Sentence Context Matching: Since attackers aren’t always explicitly mentioned, we applied multi-sentence NLP parsing to infer threat actor presence indirectly.

  • Similarity-Based Attribution: Using Fuzzy Matching, extracted names were mapped to MITRE’s threat actor list, reducing false detections.

4️⃣ Malware Intelligence: Hash Classification & OSINT Cross-Verification

Extracting malware details required integrating multiple OSINT repositories:

  • Automated Hash Classification: Using HashID and refined regex patterns, our system had accurately classified hash types.

  • Multi-Source Threat Intelligence: Every extracted hash was cross verified against:

    • VirusTotal – Fetching last analysis results, threat classifications, and antivirus detections.

    • Hybrid Analysis – Retrieving sandbox reports, behavioral analysis, and malware verdicts.

    • MalwareBazaar – Providing first-seen timestamps, YARA rule matches, and malware signatures.

  • Contextual Malware Attribution: If only a malware name (without a hash) was detected in a report, we performed semantic similarity searches to infer potential matching hashes.

  • Extended Malware Metadata: Beyond hash-based detection, we extracted detailed attributes such as file types, file sizes, execution timestamps, embedded resources, and imphash values, enabling a deeper analysis of malware behavior

5️⃣ Targeted Entity Recognition: BERT-Based NLP & Entity Context Filtering

One of the biggest challenges was accurately identifying affected entities (victims, institutions, industries). We tackled this with:

  • Fine-Tuned BERT for Entity Recognition: We used dslim/bert-base-NER to extract organization names, locations, and industry sectors.

  • Context-Aware Filtering with SBERT: Since attackers were sometimes misclassified as victims, we cross-referenced extracted entities with known MITRE adversaries to refine classification.

  • Industry Impact Categorization: We grouped organizations based on attack types, allowing for better insights into sector-specific threats.


Challenges & Key Learnings

Challenges We Faced

Building this solution was far from easy. Some of our biggest challenges included:

✔️ Data Accuracy & False Positives: IoCs extracted through regex often contained false indicators (e.g., internal IPs or harmless domains). Semantic similarity scoring helped refine accuracy.

✔️ Real-Time OSINT Integration: Ensuring seamless API calls to OSINT sources like VirusTotal, Hybrid Analysis, and MalwareBazaar while handling rate limits was a challenge.

✔️ Extracting Indirect Mentions: Many reports imply rather than explicitly state targeted organizations or threat actors. Our multi-sentence parsing approach improved detection.

✔️ Performance Optimization: Balancing speed and accuracy in NLP-based extraction was crucial, especially for large cybersecurity reports.

Key Takeaways & Future Improvements

🚀 Context-aware NLP is essential: Extracting targeted entities required multi-layered filtering beyond traditional NER models.

🚀 Regex isn’t enough: We learned that regex-based IoC extraction, while effective, needs ML-based refinement for higher precision.

🚀 Better OSINT Data Fusion: Merging results from multiple OSINT sources provided richer insights, but further confidence scoring would enhance accuracy.

🚀 Expanding TTP Detection: Future work could involve automated fine-tuning of ATT&CK mappings using custom-trained language models.


Final Thoughts – More Than Just a Hackathon

HACK IITK 2025 was more than just a competition for us—it was a chance to test our skills in a real-world cybersecurity challenge. Hosted by C3iHub, IIT Kanpur, this hackathon emphasized the growing importance of cybersecurity research and AI-driven solutions. The adrenaline rush of seeing our solution work in real-time, the sleepless nights debugging edge cases, and the moment we stood in the finals before a jury of national cybersecurity leaders—this was an experience we won’t forget.

We walked away with more than just a project. We left with a renewed passion for cybersecurity, a deeper understanding of AI-driven threat intelligence, and a drive to take this knowledge further. The insights we gained here will undoubtedly shape our future contributions to the field.


🔗 Explore More

We’ve made our GitHub repository publicly available so that the cybersecurity community can build on our work. Check it out here: [MuskanChalana/Threat-Intelligence-Extractor]

Stay curious. Keep hacking. Build for impact. 🚀

0
Subscribe to my newsletter

Read articles from Muskan Chalana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Muskan Chalana
Muskan Chalana