Writing batches is a very common need but the way we do it changed over time. From Spring-Batch to the big data solutions without forgetting JBatch specification there are a lot of framework out there to write a batch. However, rare are the ones you can rely which stay simple, light and easy to integrate in any environment. Let's see how Yupiik Batch brings a relevant alternative to these options.

What is a Batch?

To understand where Yupiik Batch comes from, we must step back and review what we call a batch.

In IT, we commonly differentiate two main application types:

  • Long running applications: applications which start and are not supposed to stop. It is the case of web application where the server is supposed to serve any request any time or daemon applications which are running all the time (for example the network manager on your computer).

  • Task oriented applications: it concerns applications starting and stopping once their task is done. It concerns the simplest command line interface (date command for example which just prints the current date) to the longest batch which can take hours but once it completed its task, it exits.

Indeed, when we speak about batche, we speak about the second category. What will make this "task" oriented application a batch is the philosophy of the development which is generally to handle a huge data volume - but not yet a big data one - and try to optimize the processing of these data.

Concretely, and to illustrate that last statement, if we insert rows in a table, we'll insert them with a bulk logic and not one by one.

What are batch challenges

So writing a batch is respecting a philosophy more than a framework, so why are there so framework out there?

It is because batches bring some constraints and the interesting thing is that it is not in the development itself but in the monitoring/administration of the batches.

Monitoring

When you have a web application, you monitor it with its logs and some endpoints (health check, metrics, ...). For a batch you have logs and that is it. If you want to check what happent it can be tricky and you will need to do some advanced querying on your logs.

While it stays a common requirement, batch frameworks also often bring a way to track what is executed and store that (often in a database) to then expose an UI on these data and let you review what was done.

The simplest solution is a row per execution of batch (a.k.a. job execution) and generally it is linked to a list of step executions which are all the steps the batch did. For example "read data from the web service" or "write data to this database".

This is exactly what Yupiik-Batch will bring with as minimal requirement as possible.

Configuration

Another common need of batch application is to be configurable.

Indeed there are plenty of solutions there but you must pick a solution which is flexible enough to:

  • Be injectable in the application (i.e. be configurable from the environment and/or the system properties of the java command),

  • Be documentable since the first users of the configuration will be the ops guys,

  • (optionally) Be easy to integrate. One example of that is to easily generate a form from your configuration to re-run a batch for example if needed.

Yupiik Batch

Yupiik Batch design is to be simple and as close as possible to plain Java code.

There is no constraints on the way you write the batch itself but the recommended way is to:

  • define your configuration using simple-configuration which consists in marking a Java class fields with @Param:

    public class MyConf {
        @Param(description = "Driver to use", required = true)
        private String driver;
    
        @Param(description = "JDBC URL to use", required = true)
        private String url;
    
        @Param(description = "Database username.")
        private String username;
    
        @Param(description = "Database password.")
        private String password;
    }
  • define a batch implementing Batch<YourConf>

    public class MyBatch implements Batch<MyConf> {
        @Override
        public void accept(final MyConf configuration) {
            // ...
        }
    }
  • implement the batch in accept method - where the configuration instance is considered initialized and injected. Here it is recommended to use the Batch DSL:

    from()
        .map("step1", new MyStep1())
        .map("step2", new MyStep2())
        .run(runConfiguration);
  • launch the batch using Batch.run which will automatically populate a configuration instance from the environment and call accept method of your batch:

    public static void main(final String... args) {
        Batch.run(MyBatch.class, args);
    }

Yupiik Batch DSL

The Yupiik Batch DSL is very simple and close to Stream API since it exposes map, filter, then (equivalent to accept in Stream API).

It always takes as first parameter a name (of the step) and second parameter a functional interface matching the method name (Function, Consumer, Predicate).

This interface has CommentifiableX - for example CommentifiableFunction - flavors which enable to attach to the step a comment which is tracked as well (so when you check what was executed you get the step name, status, duration + a custom comment enabling an easier understanding of the execution and potentially failure reason).

Yupiik Batch Base Components

A very important feature of using simple base API like Function is to ease to write components and combine them in higher level components (HLC). It enables to create higher level features but also to fully control the reporting (steps) level you want.

TIP
keep in mind the reporting is the way you communicate between the dev and ops "layers" of the project.

Yupiik Batch decided to not provide trivial components so you will not find a generic http client data reader for example - this, generally, requires too much toggles to be usable and a custom one is simpler to maintain.

However, some components are provided encapsulating common logic. You can find them listed on Yupiik Batch Components documentation.

Without entering into the details here are some very interesting highlights:

  • DatasetDiffComputer which computes a Diff between two sorted Iterator of data. It is very convenient and efficient to synchronize two datasets (a REST API with a database or two databases for example). It comes with its DiffExecutor companion which enables to execute a Diff and apply it on an output storage (you provide the implementation of the actual storage so it can be a REST API or plain SQL implementations).

  • Mapper enables to use an annotation driven mapping between two types (records):

    @Mapping(
        from = IncomingModel.class,
        to = OutputModel.class,
        documentation = "Converts an input to an output.",
        properties = {
            @Property(type = CONSTANT, to = "outputValue", value = "something"),
            @Property(type = TABLE_MAPPING, from = "inputKeyField", to = "mappedOutput", value = "myLookupTable", onMissedTableLookup = FORWARD)
        },
        tables = {
            @MappingTable(
                    name = "myLookupTable",
                    entries = {
                            @Entry(input = "A", output = "1"),
                            @Entry(input = "C", output = "3")
                    }
            )
        })
    public class MyMapperSpec {
        @Custom(description = "Maps X to Y prefixing it with `foo`.")
        String outputField(final IncomingModel in) {
            return "foo" + in.getX();
        }
    }

    The key feature there is to fully describe the mapping between two records statically. This enables to use MapperDocGenerator class to generate the documentation of the mapping and let documentation readers to know what is done in the mapping step (ops for example).

  • ExecutionTracer which, once configured in the RunConfiguration passed to the run method of the Yupiik Batch DSL, enables to store the job and step executions.

Yupiik Batch UI

This UI reads jobs and steps from a database. It is generally the UI associated to the ExecutionTracer tracker.

Here is what it can look like:

Yupiik Batch Executions

Then clicking on a batch identifier you can see the steps the batch executed:

Yupiik Batch Executions
TIP

the small + buttons enables to see the comment associated to the line it is shown on.

The last note to mention about the UI is that it has js extensions. I will not enter into the details in this post but high level it will enable you to:

  • Add routes to the UI - so add features like grabbing logs, metrics, etc...

  • Reformat/rewrite the rows so you can rewrite comments to insert direct links to some other application or data,

  • Enable you to directly access the data the batch works on,

  • Etc...

Going further

We saw that Yupiik Batch provides:

  • A simple deployment friendly configuration solution well integrated with the batch runtime,

  • A simple to deploy and wire UI to monitor your batches,

  • A simple and very flexible programming model Java friendly to write any batch application.

This is what is code focused, but if we step back we have to consider a few more points:

  • It integrates very well in Kubernetes through Job or CronJob and the configuration make it very smooth to use with a dedicated ConfigMap (highly recommended).

  • It is very simple to test since it is a plain standalone application - even if you can use Spring, CDI or Guice if you need an IoC, so JUnit 5 tests are trivial to write:
    @Test
    void test() {
        new MyBatch().accept(new MyConfig());
        assertItDidWhatWasNeeded();
    }
  • For a complete E2E deployment you can integrate it very easily with Yupiik Logging which will enable you to have a JSON logging and therefore make the Kubernetes log aggregation way more efficient. The tip there is to extend the default JSON formatter to inject a batch identifier which will enable you to filter logs per execution too (and potentially wire it in the comment automatically).

  • It is easy to integrate which enables to generate and deploy a static documentation on Github/Gitlab/FTP pages very easily.

Conclusion

Yupiik Batch was used to deploy hundreds of batches and with some good successes. Batch UI enabled us to monitor the batches very efficiently and the programming model enabled us to integrate in any environment very easily. The very nice point is that the integration in Kubernetes is super smooth and it avoids a lot of other code or infrastructure, even for cron jobs which are now supported natively by Kubernetes/OpenShift.

So stay simple is likely what applies more than ever within the cloud erea and it is exactly what Yupiik Batch was built for.

From the same author:

In the same category: