Ultimate Guide to Understanding Spring Batch 5 Processing

Akash R ChandranAkash R Chandran
20 min read

In your development life, you might have come across a place where you need to use batch processing, and you need to find the right framework that is robust, resilient and widely adopted, offering LTS (Long-Term Support) for sustained stability and support in production environments. Spring Batch is one of them. As the name suggests it is part of the larger Spring framework. Spring is a Java-based framework with many projects and a big community of contributors. If you have used or learned Spring Boot, you’re already halfway to learning Spring Batch. This blog will explain the concepts of the Spring Batch. We will implement a simple batch process that sends emails to a list of addresses. Without wasting time let’s directly get into it.

What is Batch Processing?

Batch processing is handling a large amount of data in batches. Batches can be time ranges or numbers of records, etc. We use batch processing when the amount of data is finite and known and can be processed with some delay. As per the Spring documentation, batch processing is the processing of a finite amount of data in a manner that does not require external interaction or interruption.

For example, suppose we open a Google Form to retrieve user feedback for your mobile application. Then you download the responses as a CSV file. Now you have the CSV file and want to convert the data from CSV to a database table. Here, you need batch processing to efficiently use the resources without errors or omissions. As you're processing a huge amount of data, it will take time to complete.

Why choose batch processing? One might ask. The main advantage of batch processing is that it handles a finite amount of data, which can be very large, even reaching millions, billions, or trillions of records. Depending on the requirements, you can design an approach that uses resources efficiently.

What is Spring Batch?

Spring Batch is a framework that facilitates batch processing by providing various features such as parallel processing, fault tolerance, horizontal scalability, and ease of use. It was initially developed through a collaboration between Accenture and Interface21 (now SpringSource), the creators of the Spring Framework in 2007. At the end of March 2008, the Spring Batch 1.0.0 release was made available to the public. It was one of the initial concepts for batch processing in Java.

Now, Spring Batch is being developed by VMware, the same company that develops Spring Boot. The latest version, Spring Batch 5, introduces many new features. Development is progressing rapidly, with new releases every six months. Spring Batch provides various features, making it one of the best batch-processing frameworks out there. Let’s look at some of the features.

Features of Spring Batch

  • Flexibility: Spring Batch provides excellent flexibility. Whatever batch processing you need to do, Spring Batch can support it. Run batches in parallel on a single device or distribute them across multiple devices; Spring Batch supports it all.

  • Fault Tolerance: The spring batch has various types of fault tolerance mechanisms such as the retry, skip, restart ability etc. which are essential for the perfect batch processing framework.

  • Transaction Management: Your job needs transactions in a database. Worry not, Spring Batch can automatically start transactions. If it fails at any point, it will automatically roll back, so you don’t have to worry about it.

  • Chunk-based processing: Spring Batch supports chunk processing, meaning it doesn’t commit or save until a certain number of records or items are processed. This can significantly improve performance when storing data in a database or file.

  • Job Management: Spring Batch can handle batch jobs efficiently. It stores information such as the number of items processed, the current item number, etc., in its database. In case of a crash, it can recover and continue from where it left off.

  • Support for various resources like databases, files and more: Since it is part of Spring Boot, which already supports many databases out of the box, Spring Batch inherits that capability. On top of that, it also supports many custom databases, files, and other resources.


Java isn’t your cup of tea? Why not use Python for batch processing with Luigi? It's a Python-based framework for workflow orchestration developed by Spotify. Get concepts of it in my blog link below.

Architecture of Spring Batch

The Spring batch architecture contains 3 high-level components. The architecture looks like this:

The application component contains the custom code and configuration used to build your batch processes. Business logic services and all the configurations of how you organize the job are considered the application.

The Batch Core contains the core runtime classes necessary to launch and control a batch job. It includes implementations for JobLauncher, Job, and Step. Both Application and Core are built on top of a common infrastructure.

The infrastructure contains all the readers, writers, and different services. Their core also uses such elements. The application layer is connected to the infrastructure layer due to the custom implementations that the developer creates for the read and write.

Core Components of Spring Batch

Job

A single execution unit that summarises a series of processes for batch application in Spring Batch. Multiple jobs can exist in a single application. Job is an entity that comprises the details of the entire batch process. From where to start to what comes next, every detail is contained within a Job entity. A job can contain many N number steps.

A batch job in Spring Batch is represented by the Job interface provided by the spring-batch-core dependency:

public interface Job {
    String getName();
    void execute(JobExecution execution);
}

At a fundamental level, the Job interface requires that implementations specify the Job name (the getName() method) and what the Job is supposed to do (the execute method).

The execute method gives a reference to an JobExecution object. The JobExecution represents the actual execution of the Job at runtime. It contains several runtime details, such as the start time, the end time, the execution status, and so on.

In Spring Batch, a job is defined as a Spring bean, allowing Spring Batch to manage everything from creation to termination. The definition kind of looks like this.

@Bean
public Job newsletterJob(JobRepository jobRepository, Step step1) {
    return new JobBuilder("newsletter-job", jobRepository)
            .start(step1)
            .build();
}

Step

A Step is a domain object that encapsulates an independent, sequential phase of a batch Job. It contains all of the information necessary to define a unit of work in a batch Job. The job can contain any number of steps. This step can be individually parallelised, fault-tolerant and restartable.

A Step in the Spring Batch is represented by the Step interface provided by the spring-batch-core dependency:

public interface Step {
  String getName();
  void execute(StepExecution stepExecution) throws JobInterruptedException;
}

Similar to the Job interface, the Step interface requires, at a fundamental level, an implementation to specify the step name (the getName() method) and what the step is supposed to do (the execute method).

The execute method provides a reference to an StepExecution object. The StepExecution represents the actual execution of the step at runtime. It contains several runtime details, such as the start time, the end time, the execution status, and so on. Spring Batch stores this runtime information in the metadata repository, similar to the JobExecution, as we have seen previously.

JobParameters

Job parameters are values passed to the job, allowing the entire job process to access them. One of the key features of Spring Batch is its ability to determine whether a job is new or if it is running a previous job, and this is determined using the values of job parameters passed. The values passed as job parameters will be hashed and stored. When a new job is run, the values of job parameters are compared to check whether they match any old job.

 JobParameters jobParameters = new JobParametersBuilder()
                .addString("filePath", "input/data.csv")
                .addLong("time", System.currentTimeMillis()) // make the job unique every time it runs
                .toJobParameters();

Access to the job parameters is possible both in the JobExecution context and the Stepexecution context.

JobLauncher

As we have created a job with steps now we want some way to invoke the job on startup, this can be done using the JobLauncher interface.

public interface JobLauncher {
   JobExecution run(Job job, JobParameters jobParameters)
          throws
             JobExecutionAlreadyRunningException,
             JobRestartException,
             JobInstanceAlreadyCompleteException,
             JobParametersInvalidException;
}

The run method is designed to launch a given Job with a set of JobParameters.

JobExecutionContext

It stores the different details about the job execution that are used by Spring Batch to keep track of the different parts of the job execution. The JobExecutionContext is a key-value pair storage, where Spring Batch puts some keys, and it is available to the entire job process to access. Even custom data can be added and accessed later in the job execution. It can be used to share data between steps.

JobExecution jobExecution = jobLauncher.run(job, jobParameters);
JobExecutionContext jobExecutionContext = jobExecution.getExecutionContext();
jobExecutionContext.put("jobStartTime", new Date());

StepExecutionContext

It is the same as the JobExecutionContext but only local to a step execution. There you may say each step will have its step execution context. It is also a key value store which spring batch stores keys related to the step execution. Even custom data can be added and accessed later in the step execution. It can be used to share data between different parts of the same steps.

StepExecution stepExecution = jobExecution.getStepExecutions().iterator().next();
StepExecutionContext stepExecutionContext = stepExecution.getExecutionContext();
stepExecutionContext.put("currentItem", currentItem);

JobRepository

Spring batch is stateful, it means that it stores details of the job execution and the step execution. This is denoted by the JobExecution and StepExecution. Spring batch supports different types of databases for storing its state, relational databases such as MySQL, and PostgreSQL and non-relational databases such as MongoDB. The relations between different tables can be seen below.

Each table is crucial for the spring batch to manage running job states. Each of them serves as metadata storage for different components of the spring batch environment. Let’s look into each table.

Job_Instance

The Job_Instance table stores the unique job runs. I have previously mentioned that the job parameters are used to uniquely identify a job. This is the table used for that purpose. A job can be run again if something fails and you want to restart the old job. The job parameters are hashed and stored as a job key in this table. The job name and the job key are together used to identify if the job is new or not. For each unique run of the job, the spring batch creates a new job instance.

Job_Execution

The Job_Execution table stores the details of each job run. Even if a job is restarted, it is stored as a new row, and the details are kept separate. The Job_Execution table consists of various columns that store details about the job execution, such as start_time, end_time, and status. The job execution, along with the job instance, is checked to see if the job is complete or failed. If it has failed, it can be restarted. Completed jobs cannot be restarted by default; you will need to manually configure that.

Job_Execution_Params

The Job_Execution_Params table stores the job parameters. There can be multiple job parameters for a single job. Sometimes you want to pass a value as a job parameter but don’t want that value to uniquely identify the job. In that case, you have the option to set the identifying to false, and it will be reflected in this table.

Job_Execution_Context

Used to store the JobExecutionContext data. The stored data is in Java serialized format, which is then encoded in base64. The base64 encoding is used to prevent character issues that can occur during storage.

Step_Execution

The Step_Execution stores the details about the execution of different steps. Each step execution is stored as a new row. It stores different details including the start_time, end_time, Read_count, Write_count etc. done by the step.

Step_Execution_Context

It stores the StepExecutionContext data. The stored data is in Java serialized format, which is then encoded in base64. The base64 encoding is used to prevent character issues that can occur during storage.

Types of Steps

There are many types of steps available each of them is very useful in many different scenarios. The most used step type is the Tasklet step. Anyway we will look into each of them.

TaskletStep

The Tasklet step is a straightforward step that calls the Tasklet's execute method. The Tasklet is a functional interface with an execute method that the developer must implement. Spring Batch will automatically call the Tasklet's execute method repeatedly until it returns a finished status. By default, Spring Batch runs the Tasklet step inside a transaction, ensuring data integrity and consistency. The Tasklet allows us to implement everything from reading to writing, including fault tolerance mechanisms like retry and skip.

The most famous implementation of the Tasklet is the chunk-oriented Tasklet, which will read, process, and write the data in chunks, which is very neat when it comes to processing huge amounts of data. Only for the chunk-oriented step does Spring Batch provide various components like fault tolerance mechanisms such as retry, skip, and restart ability. Even parallel processing mechanisms are only available for the chunk-oriented step. We will look into this step in detail in future blogs.

Partition Step

The partitioning step divides data into different parts based on specific criteria. It's like cutting a pizza into slices, where each slice is a partition. The criteria are set by the partition handler. Partitioning is done according to the criteria you provide. By default, Spring Batch offers only one implementation out of the box.

Flow Step

The flow step is used to define the order in which steps should be executed. It allows for conditional execution of steps. For example, if the first step is completed successfully, the next step can be executed. If the first step fails, you can skip the second step and execute the third one instead. This can be managed using the flow step.

@Bean
public Job conditionalJob() {
    return jobBuilderFactory.get("conditionalJob")
        .start(step1())
        .on("COMPLETED")
        .to(step2()) // If step1 succeeds, go to step2
        .from(step1())
        .on("FAILED")
        .to(step3()) // If step1 fails, go to step3
        .end() // End the job
        .build();
}

JobStep

It is a step that calls another job to execute. You can have multiple job definitions, and one job can have a step that calls another job. We will look into it in future blogs for now it is all you need to know.

@Bean
public Step childStep() {
    return new StepBuilder("childStep", jobRepository)
        .tasklet(
            (contribution, chunkContext)
                -> {
                System.out.println("Executing Child Job Step");
                return RepeatStatus.FINISHED;
            },
            transactionManager)
        .build();
}

@Bean
public Job childJob() {
    return new JobBuilder("childJob", jobRepository)
                         .start(childStep())
                         .build();
}

// Parent Job: Includes a JobStep to execute the child job
@Bean
public Step parentStep(Job childJob, JobLauncher jobLauncher) {
    JobStep jobStep = new JobStep();
    jobStep.setJobLauncher(jobLauncher);
    jobStep.setName("parentStepExecutingChildJob");
    return jobStep;
}

@Bean
public Job parentJob(Step parentStep) {
    return new JobBuilder("parentJob", jobRepository)
                          .start(parentStep)
                          .build();
}

Basic Job Flow

The JobLauncher will start the job. Before it begins, a new row will be created in the Job_Instance table if it is a new job, and the job parameters will be added to the Job_Execution_Params table. Then, the Job execution will start, and a new row will be added to the Job_Execution table. The job doesn't need to be unique; for every job run, a new row will be added to the Job_Execution table. The job will call the Step, and a new row will be added to the Step_Execution table. Then, the step will execute the tasklet, and so on.

Launching a Job

There are several ways to launch a job, and we'll explore all of them in this section. Each method is used in different scenarios based on your purpose.

Automatic Job Launch on Startup

If we have only one job definition, we can configure Spring Batch to launch it on startup. However, if there are multiple job definitions, this approach can cause issues, so it's better not to use it. To enable this, simply set the property to true in application.properties or application.yml.

spring:
  batch:
    job:
      enabled: true

If you have multiple job definitions then you can provide the name of the job you want to run on startup in application.properties or application.yml as below.

spring:
  batch:
    job:
      enabled: true
      name: newsletter-job

Using CommandLineRunner

It’s a good option if you want to run the job using the command line or a command. For this approach to work you have to know which class you have added the job configuration as you have to pass it to the CommandLineRunner.

$ java CommandLineJobRunner in.akashrchandran.nbt.config.BatchConfig

Using JobLauncher

There might be a need to run a Job based on our needs like we need to schedule it using the Quartz or trigger run when we get an API request etc. Then we can’t set it to run on startup or use the command line to run, the solution here is to use the Job launcher interface. You can autowire it using the spring dependency injection.

jobLauncher.run(job, new JobParameters());

You can use the JobLauncher anywhere and run the job based on your needs. I will show an example of running the job from a web container.

@Controller
public class JobLauncherController {

    @Autowired
    JobLauncher jobLauncher;

    @Autowired
    Job job;

    @PostMapping("/jobLauncher")
    public void handle() throws Exception {
        jobLauncher.run(job, new JobParameters());
    }
}

Code Example

We won't stop with just the basics. Let's try to apply what we've learned and create a simple application that sends newsletters. There will be a text file containing some random email IDs. Since the application is simple, we will implement it using the Tasklet we discussed earlier.

I will be using IntelliJ Ultimate IDE to code this project. However, you can use any IDE you like, such as Vscode, Vim, or even Notepad. It’s your choice.

Getting the dependencies

Head over to the Spring Initializer to get the dependencies for our application. We will require the dependencies I have selected below.

Let me clarify the dependencies needed. We require Spring Batch, of course. The Java Mail Sender is for sending emails, and H2 is for the in-memory database because I don't want to set up an external database for this simple project. I don't recommend using H2 in production. You should provide a database for the JobRepository. It can be almost any database provider, like MySQL, PostgreSQL, or even MongoDB.

Download the zip file and load it into your IDE, and then we can start coding.

Creating a BatchConfig class

We'll start by creating a class BatchConfig to store all the configurations for our batch process, including the Job and Step definitions. There is no need to keep the same name it’s just your choice. You have to annotate the class, to let the spring know that the class contains the configurations for our batch process.

@Configuration
public class BatchConfig {
// Batch config
}
  1. Defining the Job

From what we've learned, we know we need to define the entire batch process as a Job. The job definition is simple since we will only have one step, which is sending the emails.

    @Bean
    public Job newsletterJob(JobRepository jobRepository, Step step) {
        return new JobBuilder("newsletter-job", jobRepository)
                .start(step)
                .build();
    }

Since it is a Spring bean, the parameters will be autowired for us, that's the magic of Spring. I won't explain that here, so let's move on to the next part.

  1. Defining the Step

Next, we need to define the step and pass the tasklet, which we will write shortly, to execute.

    @Bean
    public Step newsletterStep(JobRepository jobRepository,
                               PlatformTransactionManager transactionManager,
                               Tasklet tasklet) {
        return new StepBuilder("newsletter-step", jobRepository)
                .tasklet(tasklet, transactionManager)
                .build();
    }

As we have only a single tasklet we are going to define, we can autowire that as well.

You can view the entire class code on GitHub, here is the link.

Configuration for Sending Emails

Java Mail Sender needs some setup before you can send emails, like your email address and password. I'll be using Gmail to send my emails, but you can use any other service.

For Gmail, you need to get an App Password. You can't use your regular Google account password for security reasons, so you will need to generate one. I'll provide a link to an article on how to obtain the App Password.

Now we have to add some properties to application.properties or application.yml whichever you use. I prefer the YAML file. So I will show that only, you can just use the same property keys application.properties and it should work.

spring:
  application:
    name: newsletter-batch-tasklet-async
  mail:
    properties:
      mail:
        smtp:
          auth: true
          starttls:
            enable: true
    host: smtp.gmail.com
    port: 587
    username: ${SPRING_MAIL_USERNAME}
    password: ${SPRING_MAIL_PASSWORD}

The ${SPRING_MAIL_USERNAME} and ${SPRING_MAIL_PASSWORD} are used here to get the values from the Environment Variables. You can set the environment variables. You can set the environment variables based on your operating system

# unix or linux based system
$ export SPRING_MAIL_USERNAME=example@example.com

# windows command prompt
$ set SPRING_MAIL_USERNAME=example@example.com

# windows poweshell
> $env:SPRING_MAIL_USERNAME = "example.org"

If you're using the IntelliJ IDE, you can set the environment variables in the Run Configurations.

Creating the Tasklet

To create a Tasklet step, we need to implement the Tasklet interface provided by Spring Batch. First, we will create a class named NewsletterTasklet and make it implement the Tasklet interface. After that, you will likely see an error indicating that you need to implement the methods of that interface, specifically the execute() method.

We need to annotate the class @Component to inform Spring that it is a Spring bean. Additionally, we will autowire JavaMailSender to send the mail. This recipientsFilePath is the path of the file that contains the emails. Using the @Value annotation so that we can define the values in the application.properties or applciation.yml.

@Component
@RequiredArgsConstructor
@Slf4j
public class NewsletterTasklet implements Tasklet {
    @Value("${newsletter.recipients.file}")
    private String recipientsFilePath;

    @Value("${newsletter.max.concurrent:10}")
    private int maxConcurrent;

    private final JavaMailSender mailSender;

    @Override
    public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
        // custom code to send the newsletter
    }
}
  1. Reading emails from a text file

We will be reading the emails from a text file named spam.txt which I have already kept in the resource folder. So we need to define a method which reads the text file splits the lines and returns the emails as a list. We can just define a private method for the same.

    private List<String> readRecipientsFromFile() {
        try {
            Resource resource = new ClassPathResource(recipientsFilePath);
            if (resource.exists()) {
                try (Stream<String> lines = new BufferedReader(
                        new InputStreamReader(resource.getInputStream(), StandardCharsets.UTF_8))
                        .lines()) {
                    log.info("Successfully loaded recipients from resources: {}", recipientsFilePath);
                    return lines.filter(line -> line != null && !line.isBlank())
                            .toList();
                }
            } else {
                log.warn("Resource file '{}' does not exist", recipientsFilePath);
            }
        } catch (Exception e) {
            log.error("Error reading recipients from file '{}': {}", recipientsFilePath, e.getMessage(), e);
        }
        return Collections.emptyList();
    }
  1. Sending an email to the email ID

Then we need a method to send email to the email ID we passed. We can create another private function for that as well.

    private void sendEmail(String to) {
        SimpleMailMessage message = new SimpleMailMessage();
        message.setFrom("yourEmail@domain.com");
        message.setTo(to);
        message.setSubject("Newsletter");
        message.setText("This is the content of the newsletter.");
        mailSender.send(message);
        log.info("Email sent successfully to {}", to);
    }
  1. Implementing Asynchronous Email Sending

Finally, we need to code the execute method which will use the above techniques to read and send emails. We use the ExecutorService to call the send email function asynchronously. We can use many threads to send the message but it is always to limit the number of threads that are concurrently executing. We can do something like this

@Override
public RepeatStatus execute(
    StepContribution contribution, ChunkContext chunkContext) throws Exception {
    log.info("Starting NewsletterTasklet execution with async processing.");
    List<String> recipients = readRecipientsFromFile();
    int totalRecipients = recipients.size();
    ExecutorService executor = Executors.newFixedThreadPool(maxConcurrent);
    List<CompletableFuture<Void>> allFutures = new ArrayList<>();
    // Create async tasks for all recipients
    for (String recipient : recipients) {
        CompletableFuture<Void> future = CompletableFuture.runAsync(() -> {
            try {
                sendEmail(recipient);
            } catch (Exception e) {
                log.error("Failed to send email to {}: {}", recipient,
                    e.getMessage(), e);
            }
        }, executor);
        allFutures.add(future);
    }
    CompletableFuture.allOf(allFutures.toArray(new CompletableFuture[0])).join();
    executor.shutdown();
    try {
        if (!executor.awaitTermination(30, TimeUnit.SECONDS)) {
            executor.shutdownNow();
        }
    } catch (InterruptedException e) {
        executor.shutdownNow();
        Thread.currentThread().interrupt();
    }

    log.info("Finished NewsletterTasklet execution. Total emails sent: {}", totalRecipients);
    return RepeatStatus.FINISHED;
}

Executing the batch process

I have listed about 10 emails in my spam.txt file. In a real scenario, you would read these from a user database, but this example is simpler. Before executing the batch process we need to add some more properties to the application.yml.

I will be showing the job running on application startup. You can use any way to run the job, as I have shown in the launching the jobs part.

spring:
  batch:
    jdbc:
      initialize-schema: always # to create job repository tables.
    job:
      enabled: true # to run the job on applcaiiton startup.

newsletter:
  recipients:
    file: spam.txt # file name which contains the emails.
  max:
    concurrent: 5 # limiting the number of concurrent threads.

Now we can simply run the application. Ensure you have added the environment variables; otherwise, the application will display an error.

As you can see the emails have been sent successfully within 17 seconds. I could have increased the concurrent thread count and may have gotten better results but that is not what this blog is about.

You can view the entire code on my Github repository.

11
Subscribe to my newsletter

Read articles from Akash R Chandran directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Akash R Chandran
Akash R Chandran