Converting Machine Learning Models to SAS using m2cgen (Python)

Task 1: Convert XGBoost model to VBA

# import packages
import pandas as pd
import numpy as np
import os 
import refrom sklearn import datasets
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_scoreimport m2cgen as m2c# import data
iris = datasets.load_iris()
X = iris.data
Y = iris.target

First of all, we import the packages and data needed for this task.

# split data into train and test sets
seed = 2020
test_size = 0.3
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_train)

Then, let’s train a simple XGBoost model.

code = m2c.export_to_visual_basic(model, function_name = 'pred')

Next, convert XGBoost model to VBA. Using the function, export_to_visual_basic of m2cgen can get your trained XGBoost model in VBA language. The scripts to convert to other languages are also as simple as the one to VBA.

Photo by cyda

Here comes the core of this tutorial, after converting the model to VBA, there are some steps needed to convert the VBA codes to SAS scripts such as removing many unnecessary lines that are not used in SAS environment such as “Module xxx”, “Function yyy” and “Dim var Z As Double”, and inserting “;” to the end of statements to follow the syntax rules in SAS.

# remove unnecessary things
code = re.sub('Dim var.* As Double', '', code)
code = re.sub('End If', '', code)# change the script to sas scripts
# change the beginning
code = re.sub('Module ModelnFunction pred(ByRef inputVector() As Double) As Double()n', 
'DATA pred_result;nSET dataset_name;', code)# change the ending
code = re.sub('End FunctionnEnd Modulen', 'RUN;', code)# insert ';'
all_match_list = re.findall('[0-9]+n', code)
for idx in range(len(all_match_list)):
original_str = all_match_list[idx]
new_str = all_match_list[idx][:-1]+';n'
code = code.replace(original_str, new_str)
all_match_list = re.findall(')n', code)
for idx in range(len(all_match_list)):
original_str = all_match_list[idx]
new_str = all_match_list[idx][:-1]+';n'
code = code.replace(original_str, new_str)# replace the 'inputVector' with var name
dictionary = {'inputVector(0)':'sepal_length',
'inputVector(1)':'sepal_width',
'inputVector(2)':'petal_length',
'inputVector(3)':'petal_width'} 
for key in dictionary.keys():
code = code.replace(key, dictionary[key])# change the prediction labels
code = re.sub('Math.Exp', 'Exp', code)
code = re.sub('pred = .*n', '', code)
temp_var_list = re.findall(r"var[0-9]+(d)", code)
for var_idx in range(len(temp_var_list)):
code = re.sub(re.sub('\(', '\(', re.sub('\)', '\)', temp_var_list[var_idx])), iris.target_names[var_idx]+'_prob', code)

Step-by-step Explanation:

# remove unnecessary things
code = re.sub('Dim var.* As Double', '', code)
code = re.sub('End If', '', code)# change the beginning
code = re.sub('Module ModelnFunction pred(ByRef inputVector() As Double) As Double()n', 
'DATA pred_result;nSET dataset_name;', code)# change the ending
code = re.sub('End FunctionnEnd Modulen', 'RUN;', code)

The first three parts are quite straight-forward. We simply take away the unwanted lines with the use of regex, then change the beginning of the scripts to “DATA pred_result;nSET dataset_name;” where pred_result refers to the output table name after running the SAS scripts and dataset_name refers to the input table name that we need to predict. The last part is to change the ending of the script to “RUN;”.

# insert ';'
all_match_list = re.findall('[0-9]+n', code)
for idx in range(len(all_match_list)):
original_str = all_match_list[idx]
new_str = all_match_list[idx][:-1]+';n'
code = code.replace(original_str, new_str)
all_match_list = re.findall(')n', code)
for idx in range(len(all_match_list)):
original_str = all_match_list[idx]
new_str = all_match_list[idx][:-1]+';n'
code = code.replace(original_str, new_str)

To follow the syntax rules in SAS, “;” is needed to indicate the end of each statement.

Photo by cyda

# replace the 'inputVector' with var name
dictionary = {'inputVector(0)':'sepal_length',
'inputVector(1)':'sepal_width',
'inputVector(2)':'petal_length',
'inputVector(3)':'petal_width'} 
for key in dictionary.keys():
code = code.replace(key, dictionary[key])

Making use of dictonary, we can map the “InputVector” with the variable names in the input dataset and change all the “InputVector” in one go.

# change the prediction labels
code = re.sub('Math.Exp', 'Exp', code)
code = re.sub('pred = .*n', '', code)
temp_var_list = re.findall(r"var[0-9]+(d)", code)
for var_idx in range(len(temp_var_list)):
code = re.sub(re.sub('\(', '\(', re.sub('\)', '\)', temp_var_list[var_idx])), iris.target_names[var_idx]+'_prob', code)

The last part of the conversion steps is to change the prediction labels.

Photo by cyda

# save output
vb = open('vb1.sas', 'w')
vb.write(code)
vb.close()

Lastly, we can save the output with suffix, “.sas”

That’s the end of the first task, and now, you should be able to convert your trained models to SAS scripts. To double check if there are any issues with the SAS scripts created, you can use the below scripts for checking the difference of python prediction and SAS prediction. Please note that the predicted probabilities (python vs SAS) show a little difference, but the difference should not be very significant.

# python pred
python_pred = pd.DataFrame(model.predict_proba(X_test))
python_pred.columns = ['setosa_prob','versicolor_prob','virginica_prob']
python_pred# sas pred
sas_pred = pd.read_csv('pred_result.csv')
sas_pred = sas_pred.iloc[:,-3:]
sas_pred(abs(python_pred - sas_pred) > 0.00001).sum()

Photo by cyda

Footer