PMML can be effectively used for integration with Megaladata, allowing you to apply, read, and transform models. Let's consider the process of preparing the system and installing all the necessary dependencies to work with this format.

PMML (Predictive Model Markup Language) is a standard format designed for exchanging and storing predictive models created using various machine learning and statistical methods. It allows developers and analysts to easily exchange models between different systems and tools that support PMML.

Key features of PMML

XML Format: PMML is an XML document, which makes it readable and easy to process using various software tools and programming languages.
Wide range of models: The format supports many types of models, including linear regression, decision trees, neural networks), clustering, association rules, and other machine learning algorithms.
Platform-independent: PMML models can be used in a variety of systems and environments, regardless of where or how they were created.
Ease of integration: PMML makes it easy to integrate predictive models into applications and services, allowing models to be used without having to retrain them or rewrite code.

Popularity of PMML

The PMML format has been around since 1997 and was developed by the Data Mining Group (DMG), an independent consortium that creates standards for data mining.

Today, the PMML format is supported by more than 30 major analytical system vendors, including: IBM SPSS Modeler, SAP Predictive Analytics, SAS Enterprise Miner, Microsoft Azure Machine Learning, Microsoft SQL Server 2008 Analysis Services, Oracle Data Mining, Google Cloud AutoML, Amazon SageMaker, RapidMiner, Knime, Statistica, and Apache Spark.

Many programming languages also have libraries for working with PMML. In particular:

Python (pypmml, sklearn2pmml)
R (the packages pmml, pmml2, and rpart allow you to export models created in R to PMML)
Java (JPMML and JPMML-Evaluator)
C/C++ (cPMML)

Support for PMML in popular programming languages allows you to export a model, for example implemented in Python, to a C++ or Java program.

Available models and transformations

PMML format supports the following ML models:

Anomaly detection models
Association rules
Clustering models
Regression models
k-nearest neighbors
Naive Bayes
Neural network
RuleSet
Scorecard
Tree Models
Support Vector Machines
Multiple models (model composition, ensembles, and segmentation)

PMML format supports the following transformations:

Normalization
Sampling
Value mapping
Text indexing
Functions

Embedding PMML format in Megaladata

What is this for?

If you have an already trained model, it can be transferred and used in Megaladata by exporting it to PMML format (provided that it is supported—see specification). Using PMML eliminates the need to retrain the model, which significantly saves the company's resources.

In addition, PMML is a generally accepted format, which, on the one hand, adds rigor and precision to the descriptions and, on the other hand, provides guarantees of the correctness of their interpretation.

How does it work?

Models in PMML format are easily integrated into Megaladata's processing workflows and can operate as regular nodes. Also, it's not difficult to create a universal derived component for their integration.

To work with the PMML format in Megaladata, we use the Python component, which enables employing additional libraries, including those for working with PMML.

Here we will consider the implementation of Python nodes for two use cases:

Application of a PMML model in Megaladata.
Reading field parameters in PMML format in Megaladata.

Limitations and assumptions

This functionality was tested on Megaladata version 6.5.5 and higher. On earlier versions, it is possible to use it, but due to the lack of an option to run the interpreter in a separate process, problems with uploading libraries may arise when processing is complete. To bypass this limitation, you can use subprocesses in Python, moving all interaction with the PMML format to a separate subprocess.

The operating system used is Windows. There should be no problems with working on Linux-like systems, but the processes of preparing the system and installing dependencies will be different.

ML models created on the basis of Megaladata's data mining components do not currently support exporting into PMML format. From here on, only reading and applying models created in other systems is considered.

The program code is of a demonstration nature. For use in a production environment, we recommend adding the functionality of error handling, as well as input data preparation and validation.

Preparing the system and installing dependencies

For correct operation, the following conditions must be met:

Python version 3.5 or higher
Java version 8 or higher, but below 16

To run, you don't need a Java Development Kit (JDK), just a Java Runtime Environment (JRE). It's not necessary to use the official Oracle package. You can use forks and alternative builds that match the above software.

It is important that the paths to Python and Java in the operating system are set up correctly. These paths are usually set up automatically during software installation, but exceptions may occur. To make sure that the paths are configured correctly and meet the requirements, simply enter the following lines in the command line terminal:

python --version
java --version

If Python and Java are installed and the paths are set correctly, information about the current versions will be displayed. Otherwise, an error message will appear.

You also need to install the pypmml library for Python.

If the system has access to the Internet, you can use the command in the command line terminal:

pip install pypmml

If there is no Internet access, you will need to install the specified library manually by downloading the package from the PyPI repository on another machine.

You will also need to install the Py4J package, which provides an API between Python and Java. The command in the terminal is:

pip install py4j

If you don't have a downloaded PMML model, you can use the official repository Datasets for PMML Sample Files for testing. This resource contains ready-made datasets for most models in PMML format, ideal for testing.

Case 1: Applying the PMML model in Megaladata

To apply the model, simply feed the necessary data to the input of the Python node (Predicted_model):

Workflow for working with PMML

Typically, the input data is a table of parameters that need to be predicted:

Table of parameters

For the Predicted_model node, the model_path input requires the absolute path to the directory containing the PMML file:

Setting up the path to the PMML model

The Python node should have the following settings:

Allow creating output columns in script
Start in separate process

Setting permissions

In the code input field, write the following:

# Connecting the main libraries
import builtin_data, os
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table
import numpy as np, pandas as pd
from pypmml import Model

# Getting the path from the variable port
link = InputVariables.Items['model_path'].Value

# Checking if the path is correct
if not os.path.exists(link):
    raise FileNotFoundError(f"PMML file not found at {link}")

# Loading PMML model
model = Model.load(link)

# Reading the input data to which you want to apply the PMML model 
input_frame = to_data_frame(InputTable)

# Applying the PMML model to the input data set
y = model.predict(input_frame)

# Adding the predicted values to the input dataset and generating the output table
output_frame = pd.concat([input_frame, y], axis=1)
if isinstance(OutputTable, builtin_data.ConfigurableOutputTableClass):
    prepare_compatible_table(OutputTable, output_frame, with_index=False)
fill_table(OutputTable, output_frame, with_index=False)

Here is how it looks in the application:

Code in Megaladata

After running the nodes, we will receive an output table with the results of applying the model.

Case 2: Reading field parameters in PMML format in Megaladata

Often there is a need to prepare data before processing and convert it to the required format. Let's explore how to configure another Python node, which will read the model's metadata.

Megaladata node for reading PMML model fields

To implement the functionality, you only need to configure the input variable port and provide the absolute path to the file.

The code will differ slightly from the previous version:

# Connecting the main libraries
import builtin_data, os
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table
import numpy as np, pandas as pd
from pypmml import Model

# Getting the path from the variable port
link = InputVariables.Items['model_part'].Value

# Checking if the path is correct
if not os.path.exists(link):
    raise FileNotFoundError(f"PMML file not found at {link}")

# Initializing PMML-models
model = Model.load(link)    

# Adding the predicted values to the input dataset and generate the output table
output_frame = pd.DataFrame({'input_fields': str(model.inputNames), 'output_fields': str(model.outputNames), 'model_name': str(model.modelName), 'pmml_version': str(model.version), 'model_element': str(model.modelElement)}, index=[0])
# If the option “Allow output columns to be generated from code” is enabled, the structure of the output set can be prepared by pd.DataFrame
if isinstance(OutputTable, builtin_data.ConfigurableOutputTableClass):
    prepare_compatible_table(OutputTable, output_frame, with_index=False)
fill_table(OutputTable, output_frame, with_index=False)

Python code to read PMML model fields

After running the nodes, we will receive an output table with the results of applying the model:

Output table

In this case, the output for the input fields (named input_fields) and output fields (named output_fields) is arranged as a value array. This can be changed if necessary, depending on the purpose of this auxilliary operation. Knowing the parameters the model expects in PMML format allows you to focus on preparing the data using other Megaladata components.

Using PMML in Megaladata

Table of contents