Models
Query Service now supports the core processes of building and deploying a model. You can use SQL to train the model using your data, evaluate its accuracy, and then use the trained model to make predictions on new data. You can then use the model to generalize from your past data to make informed decisions about real-world scenarios.
The three steps in the model lifecycle for generating actionable insights are:
- Training: The model learns patterns from the provided dataset. (create or replace model)
- Testing/Evaluation: The model鈥檚 performance is assessed using a separate dataset. (
model_evaluate
) - Prediction: The trained model is used to make predictions on new, unseen data.
Use the model SQL extension, added to the existing SQL grammar, to manage the model lifecycle according to your business needs. This document covers the SQL required to create or replace a model, train, evaluate, retrain when necessary, and predict insights.
Model creation and training create-and-train
Learn how to define, configure, and train a machine learning model using SQL commands. The SQL below demonstrates how to create a model, apply feature engineering transformations, and initiate the training process to ensure the model is configured correctly for future use. The following SQL commands, detail different options for model creation and management:
- CREATE MODEL: Creates and trains a new model on a specified dataset. If a model with the same name already exists, this command returns an error.
- CREATE MODEL IF NOT EXISTS: Creates and trains a new model only if a model with the same name does not already exist on the specified dataset.
- CREATE OR REPLACE MODEL: Creates and trains a model, replacing the latest version of an existing model with the same name on the specified dataset.
CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_alias
[TRANSFORM (select_list)]
[OPTIONS(model_option_list)]
[AS {select_query}]
model_option_list:
MODEL_TYPE = { 'LINEAR_REG' |
'LOGISTIC_REG' |
'KMEANS' }
[, MAX_ITER = int64_value ]
[, LABEL = string_array ]
[, REG_PARAM = float64_value ]
Example
CREATE MODEL churn_model
TRANSFORM (vector_assembler(array(current_customers, previous_customers)) features)
OPTIONS(MODEL_TYPE='linear_reg', LABEL='churn_rate')
AS
SELECT *
FROM churn_with_rate
ORDER BY period;
To help you understand the key components and configurations in the model creation and training process, the following notes explain the purpose and function of each element in the SQL example above.
-
<model_alias>
: The model alias is a reusable name assigned to the model, which can be referenced later. It is required to give your model a name. -
transform
: The transform clause is used to apply feature engineering transformations (for example, one-hot encoding and string indexing) to the dataset before training the model. The last clause of theTRANSFORM
statement should be either avector_assembler
with a list of columns that would compose the features for model training, or a derived type of thevector_assembler
(such asmax_abs_scaler(feature)
,standard_scaler(feature)
, and so on). Only the columns mentioned in the last clause will be used for training; all other columns, even if included in theSELECT
query, will be excluded. -
label = <label-COLUMN>
: The label column in the training dataset that specifies the target or outcome the model aims to predict. -
training-dataset
: This syntax selects the data used to train the model. -
type = 'LogisticRegression'
: This syntax specifies the type of machine learning algorithm to use. Options includeLinearRegression
,LogisticRegression
, andKMeans
. -
options
: This keyword provides a flexible set of key-value pairs to configure the model.Key model_type
:model_type = '<supported algorithm>'
: Specifies the type of machine learning algorithm to use. Supported options includeLinearRegression
,LogisticRegression
, andKMeans
.Key label
:label = <label_COLUMN>
: Defines the label column in the training dataset, which indicates the target or outcome the model is aiming to predict.
Use SQL to reference the dataset used for training.
TRANSFORM
clause, including supported functions and usage across both CREATE MODEL
and CREATE TABLE
, see the TRANSFORM
clause in the SQL Syntax documentation.Update a model update
Learn how to update an existing machine learning model by applying new feature engineering transformations and configuring options such as the algorithm type and label column. Each update creates a new version of the model, incremented from the last version. This ensures changes are tracked, and the model can be reused in future evaluation or prediction steps.
The following example demonstrates updating a model with new transformations and options:
UPDATE MODEL <model_alias> TRANSFORM (vector_assembler(array(current_customers, previous_customers)) features) OPTIONS(MODEL_TYPE='logistic_reg', LABEL='churn_rate') AS SELECT * FROM churn_with_rate ORDER BY period;
Example
To help you understand the versioning process, consider the following command:
UPDATE MODEL model_vdqbrja OPTIONS(MODEL_TYPE='logistic_reg', LABEL='Survived') AS SELECT * FROM titanic_e2e_dnd;
After this command is executed, the model has a new version, as shown in the table below:
The following notes explain the key components and options in the model update workflow.
UPDATE model <model_alias>
: The update command handles versioning and creates a new model version incremented from the last version.version
: An optional keyword used only during updates to explicitly specify that a new version should be created. If omitted, the system automatically increments the version.
Preview and persist transformed features preview-transform-output
Use the TRANSFORM
clause within CREATE TABLE
and CREATE TEMP TABLE
statements to preview and persist the output of feature transformations before model training. This enhancement provides visibility into how transformation functions (such as encoding, tokenization, and vector assembler) are applied to your dataset.
By materializing transformed data into a standalone table, you can inspect intermediate features, validate processing logic, and ensure feature quality before creating a model. This improves transparency across the machine learning pipeline and supports more informed decision-making during model development.
Syntax syntax
Use the TRANSFORM
clause within a CREATE TABLE
or CREATE TEMP TABLE
statement as shown below:
CREATE TABLE [IF NOT EXISTS] table_name
[WITH (tableProperties)]
TRANSFORM (transformFunctionExpression1, transformFunctionExpression2, ...)
AS SELECT * FROM source_table;
Or:
CREATE TEMP TABLE [IF NOT EXISTS] table_name
[WITH (tableProperties)]
TRANSFORM (transformFunctionExpression1, transformFunctionExpression2, ...)
AS SELECT * FROM source_table;
Example
Create a table using basic transformations:
CREATE TABLE ctas_transform_table
TRANSFORM(
String_Indexer(additional_comments) si_add_comments,
one_hot_encoder(si_add_comments) as ohe_add_comments,
tokenizer(comments) as token_comments
)
AS SELECT * FROM movie_review;
Create a temporary table using additional feature engineering steps:
CREATE TEMP TABLE ctas_transform_table
TRANSFORM(
String_Indexer(additional_comments) si_add_comments,
one_hot_encoder(si_add_comments) as ohe_add_comments,
tokenizer(comments) as token_comments,
stop_words_remover(token_comments, array('and','very','much')) stp_token,
ngram(stp_token, 3) ngram_token,
tf_idf(ngram_token, 20) ngram_idf,
count_vectorizer(stp_token, 13) cnt_vec_comments,
tf_idf(token_comments, 10, 1) as cmts_idf
)
AS SELECT * FROM movie_review;
Then query the output:
SELECT * FROM ctas_transform_table LIMIT 1;
Important considerations considerations
While this feature enhances transparency and supports feature validation, there are important limitations to consider when using the TRANSFORM
clause outside of model creation.
- Vector outputs: If the transformation generates vector-type outputs, they are automatically converted to arrays.
- Batch reuse limitation: Tables created with
TRANSFORM
can only apply transformations during table creation. New batches of data inserted withINSERT INTO
are not automatically transformed. To apply the same transformation logic to new data, you must recreate the table using a newCREATE TABLE AS SELECT
(CTAS) statement. - Model reuse limitation: Tables created using
TRANSFORM
cannot be directly used inCREATE MODEL
statements. You must redefine theTRANSFORM
logic during model creation. Transformations that produce vector-type outputs are not supported during model training. For more information, see the Feature transformation output data types.
Evaluate models evaluate-model
To ensure reliable results, assess the accuracy and effectiveness of the model before deploying it for predictions with the model_evaluate
keyword. The SQL statement below specifies a test dataset, specific columns, and the model鈥檚 version to test the model by evaluating its performance.
SELECT *
FROM model_evaluate(model-alias, version-number,SELECT col1,
col2,
label-COLUMN
FROM test_dataset)
The model_evaluate
function takes model-alias
as its first argument and a flexible SELECT
statement as its second argument. Query Service first executes the SELECT
statement and maps the results to the model_evaluate
51黑料不打烊 Defined Function (ADF). The system expects the column names and data types in the SELECT
statement鈥檚 result to match those used in the training step. These column names and data types are treated as test data and label data for evaluation.
model_evaluate
) and predicting (model_predict
), the transformation(s) conducted at the time of training are used.Predict predict
model_predict
are controlled by a feature flag. By default, intermediate fields such as probability
and rawPrediction
are not included in the prediction output.To enable access to these intermediate fields, run the following command before executing
model_predict
:set advanced_statistics_show_hidden_fields=true;
Use the model_predict
keyword to apply the specified model and version to a dataset and generate predictions. You can select all output columns, choose specific ones, or assign aliases to improve output clarity.
By default, only base columns and the final prediction are returned unless the feature flag is enabled.
SELECT * FROM model_predict(model-alias, version-number, SELECT col1, col2 FROM dataset);
Select specific output fields select-specific-output-fields
When the feature flag is enabled, you can retrieve a subset of fields from the model_predict
output. Use this to retrieve intermediate results, such as prediction probabilities, raw prediction scores, and base columns from the input query.
Case 1: Return all available output fields
SELECT * FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);
Case 2: Return selected columns
SELECT a, b, c, probability, predictionCol FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);
Case 3: Return selected columns with aliases
SELECT a, b, c, probability AS p1, predictionCol AS pdc FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);
In each case, the outer SELECT
controls which result fields are returned. These include base fields from the input query, along with prediction outputs such as probability
, rawPrediction
, and predictionCol
.
Persist predictions using CREATE TABLE or INSERT INTO
You can persist predictions using either 鈥淐REATE TABLE AS SELECT鈥 or 鈥淚NSERT INTO SELECT鈥, including prediction outputs if desired.
Example: Create table with all prediction output fields
CREATE TABLE scored_data AS SELECT * FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);
Example: Insert selected output fields with aliases
INSERT INTO scored_data SELECT a, b, c, probability AS p1, predictionCol AS pdc FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);
This provides flexibility to select and persist only the relevant prediction output fields and base columns for downstream analysis or reporting.
Evaluate and manage your models
Use the SHOW MODELS
command to list all the available models you have created. Use it to view the models that have been trained and are available for evaluation or prediction. When queried, the information is fetched from the model repository which is updated during model creation. The details returned are: model ID, model name, version, source dataset, algorithm details, options/parameters, created/updated time, and the user who created the model.
SHOW MODELS;
The results appear in a table similar to the one seen below:
model-84362-mdunj
SalesModel
sales_data_2023
LogisticRegression
{"label": "label-field"}
one_hot_encoder(name)
, ohe_name
, string_indexer(gender)
, genderSI
JohnSnow@adobe.com
Cleanup and maintain your models
Use the DROP MODELS
command to delete the models you created from the model registry. You can use it to remove outdated, unused, or unwanted models. This frees up resources and ensuring that only relevant models are maintained. You can also include an optional model name for improved specificity. This only drops model with the provided model version.
DROP MODEL IF EXISTS modelName
DROP MODEL IF EXISTS modelName modelVersion ;
Next steps
After reading this document, you now understand the base SQL syntax required to create, train, and manage trusted models using Data Distiller. Next, explore the Implement advanced statistical models document to learn about the various trusted models available and how to implement them effectively within your SQL workflows. If you haven鈥檛 already, make sure to review the Feature Engineering document to ensure that your data is optimally prepared for model training.