Why is my crawler unable to recognize the newly added partition?

I’ve recently added a new partition to my dataset in a specific directory, but my data crawler (configured in PyCharm with Apache Spark integration) seems unable to detect or recognize this new partition when I run it. The crawler has worked fine in the past, and it continues to recognize other existing partitions without any issues. However, this newly added partition does not appear in the processed data or logs when the crawler runs.

Here’s a breakdown of the steps I’ve taken and relevant information:

1. Current Setup:

• Tooling: I’m using a custom crawler developed in Python with Spark, running in PyCharm. The crawler is responsible for scanning a directory structure and reading partitioned data (using Hive-style partitioning).

• Data Storage: My dataset is stored on an S3 file system. Each partition corresponds to a specific subdirectory in S3 organized by date, e.g., /data/partition_date=2024-09-28/.

• Partition Scheme: The partitioning is done based on a specific column (e.g., partition_date). This has been working fine for all previous partitions.

2. What I Did Recently:

• I added a new directory for a recent date, for example: /data/partition_date=2024-09-28/.

• I verified that the new partition contains the correct data and follows the same structure as previous partitions.

• The folder and file permissions on S3 seem to be correctly set and mirror those of older partitions.

• When I manually check the directory via S3, the new partition is visible, and I can access its contents.

3. The Problem:

• When I run my crawler, it does not seem to detect this new partition. There are no errors or exceptions related to file access, but the crawler does not process the new data in the recently added partition.

• The logs from the crawler indicate that it scanned and processed the older partitions but skipped over the new partition, as if it doesn’t exist.

• Other partitions from earlier dates continue to be detected and processed as expected.

4. What I’ve Tried:

• Re-running the Crawler: I restarted the entire process multiple times, thinking it might have missed the partition during a single scan.

• Refreshing Metadata: I updated the metadata in Spark to reflect the latest directory structure and ran MSCK REPAIR TABLE to ensure all partitions are recognized, but the new partition still doesn’t appear.

• Manual Check: I used Spark’s show partitions command to list all the available partitions, and the new one is missing from the results.

• Logs: I added additional logging to the crawler to print out the directories and partitions it scans, but the new partition never shows up in the logs.

5. Possible Hypotheses:

• There may be a mismatch between how the new partition is named or structured and what the crawler expects. However, I’ve double-checked the naming and structure, and it appears consistent with the existing partitions.

• There could be an issue with S3 permissions or visibility, but I’ve manually verified that the permissions match those of the other partitions.

• The crawler may be caching partition information and failing to update it with the new partition. I haven’t yet identified how to clear or refresh any such cache in Spark.

AWS crawler unable to detect partition which has been added recently

Subscribe to my newsletter

learner

learner