51黑料不打烊

Models

AVAILABILITY
This functionality is available to customers who have purchased the Data Distiller add on. For more information, contact your 51黑料不打烊 representative.

Query Service now supports the core processes of building and deploying a model. You can use SQL to train the model using your data, evaluate its accuracy, and then use the trained model to make predictions on new data. You can then use the model to generalize from your past data to make informed decisions about real-world scenarios.

The three steps in the model lifecycle for generating actionable insights are:

  1. Training: The model learns patterns from the provided dataset. (create or replace model)
  2. Testing/Evaluation: The model鈥檚 performance is assessed using a separate dataset. (model_evaluate)
  3. Prediction: The trained model is used to make predictions on new, unseen data.

Use the model SQL extension, added to the existing SQL grammar, to manage the model lifecycle according to your business needs. This document covers the SQL required to create or replace a model, train, evaluate, retrain when necessary, and predict insights.

Model creation and training create-and-train

Learn how to define, configure, and train a machine learning model using SQL commands. The SQL below demonstrates how to create a model, apply feature engineering transformations, and initiate the training process to ensure the model is configured correctly for future use. The following SQL commands, detail different options for model creation and management:

  • CREATE MODEL: Creates and trains a new model on a specified dataset. If a model with the same name already exists, this command returns an error.
  • CREATE MODEL IF NOT EXISTS: Creates and trains a new model only if a model with the same name does not already exist on the specified dataset.
  • CREATE OR REPLACE MODEL: Creates and trains a model, replacing the latest version of an existing model with the same name on the specified dataset.
CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_alias
[TRANSFORM (select_list)]
[OPTIONS(model_option_list)]
[AS {select_query}]

model_option_list:
    MODEL_TYPE = { 'LINEAR_REG' |
                   'LOGISTIC_REG' |
                   'KMEANS' }
  [, MAX_ITER = int64_value ]
 [, LABEL = string_array ]
[, REG_PARAM = float64_value ]

Example

CREATE MODEL churn_model
TRANSFORM (vector_assembler(array(current_customers, previous_customers)) features)
OPTIONS(MODEL_TYPE='linear_reg', LABEL='churn_rate')
AS
SELECT *
FROM churn_with_rate
ORDER BY period;

To help you understand the key components and configurations in the model creation and training process, the following notes explain the purpose and function of each element in the SQL example above.

  • <model_alias>: The model alias is a reusable name assigned to the model, which can be referenced later. It is required to give your model a name.

  • transform: The transform clause is used to apply feature engineering transformations (for example, one-hot encoding and string indexing) to the dataset before training the model. The last clause of the TRANSFORM statement should be either a vector_assembler with a list of columns that would compose the features for model training, or a derived type of the vector_assembler (such as max_abs_scaler(feature), standard_scaler(feature), and so on). Only the columns mentioned in the last clause will be used for training; all other columns, even if included in the SELECT query, will be excluded.

  • label = <label-COLUMN>: The label column in the training dataset that specifies the target or outcome the model aims to predict.

  • training-dataset: This syntax selects the data used to train the model.

  • type = 'LogisticRegression': This syntax specifies the type of machine learning algorithm to use. Options include LinearRegression, LogisticRegression, and KMeans.

  • options: This keyword provides a flexible set of key-value pairs to configure the model.

    • Key model_type: model_type = '<supported algorithm>': Specifies the type of machine learning algorithm to use. Supported options include LinearRegression, LogisticRegression, and KMeans.
    • Key label: label = <label_COLUMN>: Defines the label column in the training dataset, which indicates the target or outcome the model is aiming to predict.

Use SQL to reference the dataset used for training.

TIP
For a full reference on the TRANSFORM clause, including supported functions and usage across both CREATE MODEL and CREATE TABLE, see the TRANSFORM clause in the SQL Syntax documentation.

Update a model update

Learn how to update an existing machine learning model by applying new feature engineering transformations and configuring options such as the algorithm type and label column. Each update creates a new version of the model, incremented from the last version. This ensures changes are tracked, and the model can be reused in future evaluation or prediction steps.

The following example demonstrates updating a model with new transformations and options:

UPDATE MODEL <model_alias> TRANSFORM (vector_assembler(array(current_customers, previous_customers)) features)  OPTIONS(MODEL_TYPE='logistic_reg', LABEL='churn_rate')  AS SELECT * FROM churn_with_rate ORDER BY period;

Example

To help you understand the versioning process, consider the following command:

UPDATE MODEL model_vdqbrja OPTIONS(MODEL_TYPE='logistic_reg', LABEL='Survived') AS SELECT * FROM titanic_e2e_dnd;

After this command is executed, the model has a new version, as shown in the table below:

Updated Model ID
Updated Model
New Version
a8f6a254-8f28-42ec-8b26-94edeb4698e8
model_vdqbrja
2

The following notes explain the key components and options in the model update workflow.

  • UPDATE model <model_alias>: The update command handles versioning and creates a new model version incremented from the last version.
  • version: An optional keyword used only during updates to explicitly specify that a new version should be created. If omitted, the system automatically increments the version.

Preview and persist transformed features preview-transform-output

Use the TRANSFORM clause within CREATE TABLE and CREATE TEMP TABLE statements to preview and persist the output of feature transformations before model training. This enhancement provides visibility into how transformation functions (such as encoding, tokenization, and vector assembler) are applied to your dataset.

By materializing transformed data into a standalone table, you can inspect intermediate features, validate processing logic, and ensure feature quality before creating a model. This improves transparency across the machine learning pipeline and supports more informed decision-making during model development.

Syntax syntax

Use the TRANSFORM clause within a CREATE TABLE or CREATE TEMP TABLE statement as shown below:

CREATE TABLE [IF NOT EXISTS] table_name
[WITH (tableProperties)]
TRANSFORM (transformFunctionExpression1, transformFunctionExpression2, ...)
AS SELECT * FROM source_table;

Or:

CREATE TEMP TABLE [IF NOT EXISTS] table_name
[WITH (tableProperties)]
TRANSFORM (transformFunctionExpression1, transformFunctionExpression2, ...)
AS SELECT * FROM source_table;

Example

Create a table using basic transformations:

CREATE TABLE ctas_transform_table
TRANSFORM(
  String_Indexer(additional_comments) si_add_comments,
  one_hot_encoder(si_add_comments) as ohe_add_comments,
  tokenizer(comments) as token_comments
)
AS SELECT * FROM movie_review;

Create a temporary table using additional feature engineering steps:

CREATE TEMP TABLE ctas_transform_table
TRANSFORM(
  String_Indexer(additional_comments) si_add_comments,
  one_hot_encoder(si_add_comments) as ohe_add_comments,
  tokenizer(comments) as token_comments,
  stop_words_remover(token_comments, array('and','very','much')) stp_token,
  ngram(stp_token, 3) ngram_token,
  tf_idf(ngram_token, 20) ngram_idf,
  count_vectorizer(stp_token, 13) cnt_vec_comments,
  tf_idf(token_comments, 10, 1) as cmts_idf
)
AS SELECT * FROM movie_review;

Then query the output:

SELECT * FROM ctas_transform_table LIMIT 1;

Important considerations considerations

While this feature enhances transparency and supports feature validation, there are important limitations to consider when using the TRANSFORM clause outside of model creation.

  • Vector outputs: If the transformation generates vector-type outputs, they are automatically converted to arrays.
  • Batch reuse limitation: Tables created with TRANSFORM can only apply transformations during table creation. New batches of data inserted with INSERT INTO are not automatically transformed. To apply the same transformation logic to new data, you must recreate the table using a new CREATE TABLE AS SELECT (CTAS) statement.
  • Model reuse limitation: Tables created using TRANSFORM cannot be directly used in CREATE MODEL statements. You must redefine the TRANSFORM logic during model creation. Transformations that produce vector-type outputs are not supported during model training. For more information, see the Feature transformation output data types.
NOTE
This feature is designed for inspection and validation. It is not a substitute for reusable pipeline logic. Any transformations intended for model input must be explicitly redefined in the model creation step.

Evaluate models evaluate-model

To ensure reliable results, assess the accuracy and effectiveness of the model before deploying it for predictions with the model_evaluate keyword. The SQL statement below specifies a test dataset, specific columns, and the model鈥檚 version to test the model by evaluating its performance.

SELECT *
FROM   model_evaluate(model-alias, version-number,SELECT col1,
       col2,
       label-COLUMN
FROM   test_dataset)

The model_evaluate function takes model-alias as its first argument and a flexible SELECT statement as its second argument. Query Service first executes the SELECT statement and maps the results to the model_evaluate 51黑料不打烊 Defined Function (ADF). The system expects the column names and data types in the SELECT statement鈥檚 result to match those used in the training step. These column names and data types are treated as test data and label data for evaluation.

IMPORTANT
When evaluating (model_evaluate) and predicting (model_predict), the transformation(s) conducted at the time of training are used.

Predict predict

IMPORTANT
Enhanced column selection and aliasing for model_predict are controlled by a feature flag. By default, intermediate fields such as probability and rawPrediction are not included in the prediction output.
To enable access to these intermediate fields, run the following command before executing model_predict:
set advanced_statistics_show_hidden_fields=true;

Use the model_predict keyword to apply the specified model and version to a dataset and generate predictions. You can select all output columns, choose specific ones, or assign aliases to improve output clarity.

By default, only base columns and the final prediction are returned unless the feature flag is enabled.

SELECT * FROM model_predict(model-alias, version-number, SELECT col1, col2 FROM dataset);

Select specific output fields select-specific-output-fields

When the feature flag is enabled, you can retrieve a subset of fields from the model_predict output. Use this to retrieve intermediate results, such as prediction probabilities, raw prediction scores, and base columns from the input query.

Case 1: Return all available output fields

SELECT * FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);

Case 2: Return selected columns

SELECT a, b, c, probability, predictionCol FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);

Case 3: Return selected columns with aliases

SELECT a, b, c, probability AS p1, predictionCol AS pdc FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);

In each case, the outer SELECT controls which result fields are returned. These include base fields from the input query, along with prediction outputs such as probability, rawPrediction, and predictionCol.

Persist predictions using CREATE TABLE or INSERT INTO

You can persist predictions using either 鈥淐REATE TABLE AS SELECT鈥 or 鈥淚NSERT INTO SELECT鈥, including prediction outputs if desired.

Example: Create table with all prediction output fields

CREATE TABLE scored_data AS SELECT * FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);

Example: Insert selected output fields with aliases

INSERT INTO scored_data SELECT a, b, c, probability AS p1, predictionCol AS pdc FROM model_predict(modelName, 1, SELECT a, b, c FROM dataset);

This provides flexibility to select and persist only the relevant prediction output fields and base columns for downstream analysis or reporting.

Evaluate and manage your models

Use the SHOW MODELS command to list all the available models you have created. Use it to view the models that have been trained and are available for evaluation or prediction. When queried, the information is fetched from the model repository which is updated during model creation. The details returned are: model ID, model name, version, source dataset, algorithm details, options/parameters, created/updated time, and the user who created the model.

SHOW MODELS;

The results appear in a table similar to the one seen below:

model-id
model-name
version
source-dataset
type
options
transform
fields
created
updated
created BY
model-84362-mdunj
SalesModel
1.0
sales_data_2023
LogisticRegression
{"label": "label-field"}
one_hot_encoder(name), ohe_name, string_indexer(gender), genderSI
[鈥渘ame鈥, 鈥済ender鈥漖
2024-08-14 10:30 AM
2024-08-14 11:00 AM
JohnSnow@adobe.com

Cleanup and maintain your models

Use the DROP MODELS command to delete the models you created from the model registry. You can use it to remove outdated, unused, or unwanted models. This frees up resources and ensuring that only relevant models are maintained. You can also include an optional model name for improved specificity. This only drops model with the provided model version.

DROP MODEL IF EXISTS modelName
DROP MODEL IF EXISTS modelName modelVersion ;

Next steps

After reading this document, you now understand the base SQL syntax required to create, train, and manage trusted models using Data Distiller. Next, explore the Implement advanced statistical models document to learn about the various trusted models available and how to implement them effectively within your SQL workflows. If you haven鈥檛 already, make sure to review the Feature Engineering document to ensure that your data is optimally prepared for model training.

recommendation-more-help
ccf2b369-4031-483f-af63-a93b5ae5e3fb