Improving Data Categorization in Marketo Engage Using Fine-Tuned AI Models
As a Revenue Ops professional, you may be struggling with SPAM form submissions, keyword matching in job titles to determine personas, or messy open-text fields that make it hard to extract insights from your data. These data categorization challenges hinder segmentation, personalization, and reporting, preventing your team from leveraging your data and making it difficult to send tailored content to your audience.
Explore how fine-tuned Large Language Models (LLMs) can help address these persistent data problems. Learn how custom-trained models can significantly boost the accuracy of SPAM filtering, automate persona classification, and intelligently categorize unstructured inputs, and be confident about bringing AI into Marketo Engage.
You will learn about,
- Real-world use cases where AI meaningfully improves data categorization in Marketo Engage.
- How to fine-tune an LLM using your own data (featuring OpenAI as an example).
- Using the Fine-Tuned model in Marketo Engage via Webhooks.
Hi, everyone. Welcome to today鈥檚 presentation on improving data categorization in Marketo Engage using fine-tuned AI models. My name is Tyran Pretorius. And today I鈥檓 going to be walking you through three use cases for fine-tuned models in Marketo Engage, how you can detect spam form fills, how you can match job titles to personas, and how you can categorize open text fields. I鈥檒l show you and inspire you why you should start using these three use cases in your instance. And then I鈥檒l also show you how to set them up using webhooks to make the OpenAI requests, how you can train your data set for the OpenAI fine-tuning interface, how you can create the fine-tuned model, and then finally, I鈥檒l speak to some of the limitations of webhooks and why you might want to use self-service flow steps to overcome these.
So as you can probably tell, there鈥檚 a bit of a discrepancy between my name and my accent. I was born in South Africa, race in Ireland, and I now live in San Diego, so I鈥檝e bounced around quite a bit. I love volleyball, dislike surfing. And before I joined the marketing operations world, I was a mechanical engineer. I love problem solving and data analysis. So now I鈥檓 just doing that in the business domain instead of the engineering domain.
And I鈥檓 also a blogger in my free time at the Workflow Pro, where I talk about all the AI and Marketo projects that I鈥檓 working on. And there鈥檚 a picture of me presenting at Summit with my friend Lucas Mercado.
And now without further ado, let鈥檚 dive into the first use case of how we can detect spam form submissions using OpenAI. So if any of you have used the in-built CAPTCHA integration with Marketo, you might have noticed that sometimes a clear spam form fill like this gets a very high CAPTCHA score, and it gets the trusted label.
And on the other hand, you might have genuine form submissions like this where you can clearly see that if someone interested in SIM cards for vehicle tracking, they鈥檝e got a legit looking email address, website, and phone number, but they鈥檙e still being marked as suspicious by CAPTCHA and given a very low CAPTCHA score.
Where this causes issues downstream is if you use the CAPTCHA normalized score is not suspicious constraint on your triggers, then what this results in is that genuine person who is looking for SIM cards, they won鈥檛 make it through to the sales team because of their low CAPTCHA score. And on the other hand, that clear spam form submission with all the random characters and all the fields, that gets classified as genuine. So we鈥檒l make it through this trigger and into the rest of the flow of your smart campaign. So this will waste salespeople鈥檚 time for all these spam form submissions. And then you鈥檝e got genuine leads who aren鈥檛 making it to sales, which can lead to lost revenue. So that鈥檚 why it鈥檚 important that we start using OpenAI to help us detect these spam form submissions. And the thing I like the most is that with CAPTCHA, it鈥檚 a bit of a black box. You don鈥檛 really know why Google classified a lead as trusted or suspicious. Whereas when you define your own fine-tuned model, you specify the rules which determines whether someone鈥檚 going to be a spam form submission or genuine.
And you can also prompt it to give you an explanation why a lead was marked as genuine or a bot.
And I鈥檒l show you how to set up the webhook later. But the main idea is that once your smart campaign is triggered, we鈥檒l call the form fill categorization webhook, and we鈥檒l map the output of this webhook. You鈥檒l see there鈥檚 this field here. If this spam categorization field, which we map to the output of the webhook contains bot, then we鈥檙e going to stop the person from progressing any further down the rest of the flow. So that鈥檚 how we can screen out all these spam form submissions from our Marketo smart campaigns once we鈥檝e used the OpenAI spam categorization fine-tuned model that we鈥檝e created. The next use case I鈥檓 going to talk about is persona matching based on job title. So if any of you have tried to do persona matching based on job title, when you鈥檙e trying to do title contains certain keywords, you鈥檝e likely run into inadvertent matches where let鈥檚 say you鈥檙e trying to match on the chief operating officer persona, and you try and do job title contains COO, that can have inadvertent matches for things like cook and coordinator, which you obviously don鈥檛 want being associated with. So you鈥檙e trying to match the person associated with the chief operating officer persona.
And then what if someone enters a job title that鈥檚 not in English, if it鈥檚 in Spanish or French, or what if they misspelled their job title, and they say chef operating officer instead of chief. These are all issues that exist with the current keyword matching on job title to try and get corresponding personas.
But these can all be solved by using AI. And the powerful thing is that AI is smart enough to know what job titles and job title acronyms correspond to a chief operating officer. So if someone misspells it, and they say chef operating officer, it鈥檚 smart enough to know that that still maps to the chief operating officer persona.
And it can also handle any language. So that there is director de operations. I think that鈥檚 Spanish for obviously chief operating officer, it鈥檚 still smart enough to know that that maps to your same CEO persona.
And the way our smart campaign flow would be, is that when anyone is created with a job title populated, or their job title changes, we鈥檙e going to call the persona categorization webhook to make that request to open AI. And then once we get that persona value back, that persona field is mapped, which you can see in the flow here, that persona fields map to the output of the webhook. So when we get our persona back, we can maybe do some lead scoring based off of that and give people more points if they鈥檙e in the C suite versus if they鈥檙e an engineer.
And the powerful thing of building our fine tuned models on top of the broader base model that open AI offers, so let鈥檚 say GPT4O building on top of that broader model means that even though our training data set might not contain misspellings, or it might not contain different languages, it is smart enough to extrapolate away from our training data set so that if there are different languages, or if there are misspellings, it鈥檚 smart enough to still do the correct persona matching because we鈥檙e building our fine tuned model and that fine tuned data set on top of the broader base model.
Now the third and final use case is open text field categorization. So at my company Telnex, we have this, how did you hear about us field and it鈥檚 an open text field. So imagine for a second that it wasn鈥檛 an open text field and it was just a drop down pick list. What you鈥檙e really getting here is reinforcement for attribution sources that you already know about paid search, organic, you鈥檝e got social media, you might have like LLM bots now is another option. But they鈥檙e just reinforcing all the sources you already know about. You鈥檙e not really gaining new insights. Whereas if you leave this as an open text field, you鈥檒l be able to see new insights like YouTube influencers who might be referencing your products. There might be apps that are listing you as an integration partner, or there might be blogs linking to your website, you wouldn鈥檛 be able to find out all these unique new sources if you had a drop down. So this is the power of having an open text field for something like this.
But then the challenge downstream is it makes it difficult to segment, personalize and report on your data. And it makes it difficult if you want to send tailored content to people based on all these attribution values. And you鈥檒l notice here how stratified each of the bars is in the bar chart. And that鈥檚 because of all the infinite values that people can type in this open text field. And to give a very concrete example, people will misspell Google, they鈥檒l say Google, Goggle and Google. And although these should all match to the exact same value, we know these all should go to Google, they all appear as distinct lines on the bar chart here. So that鈥檚 just an example of how it can be difficult to look at this open text field data if you want to analyze it. And obviously, if you鈥檙e getting less than 10 form fills a day, it might be possible for you to manually, manually review these fields, and then see insights and patterns. But if you鈥檝e got hundreds of form fills a day, you鈥檙e obviously not going to have the bandwidth to manually go through all these open text field values to do the categorization. And that鈥檚 where AI comes in, it can do this categorization for us.
And as you can see here, I鈥檝e just used the example where we鈥檙e categorizing into a few different buckets like organic, referral and unknown. And as I mentioned before, AI is very powerful when it comes to fuzzy matching. So it can handle misspellings like it will know to put Google and goggle in the organic Google bucket. And it can also handle different languages. So Je vous est trouv茅 en search茅 en l鈥檌n. I took French for five years in high school, so hopefully I鈥檓 not too rusty there. But that basically means I found you while searching online. So that should also go in the organic bucket. So it鈥檚 also smart enough to put that in the organic bucket. And here, just to keep it simple, I鈥檝e only got it bucketing into these, you could say higher level buckets like organic, referral and unknown. But you can also prompt it if you want to give you a lead source detail. So the lead source could be organic, and then you could prompt it to give you a lead source detail of Google.
Okay, so now that I鈥檝e shown you the three use cases, and it鈥檚 hopefully inspired you to start thinking of ways that you could categorize your own data within Marketo, I want to take a step back and show you how we need to set up the webhook in Marketo so that we can leverage OpenAI to do this categorization for us. So this is what the webhook looks like here. You鈥檒l see that in the URL field, we were specifying the OpenAI URL here. And there鈥檚 a champion blog post, which I鈥檒l show you in the resources at the end of this presentation, where you can access this URL. And you鈥檒l also need it later on to access things like your OpenAI API key.
So don鈥檛 worry about that. There鈥檚 a blog post that shares all these resources with you.
And I鈥檓 going to go to the next slide because it鈥檚 easier to look at the payload of the webhook when it鈥檚 zoomed in like this. So the model value here, this is going to be the fine-tuned model ID, which we鈥檒l get after we create the model in OpenAI. The temperature ranges from zero to two. If you put zero in, then the output will be very robotic and deterministic. So if you put the same input in, you鈥檙e very likely to get the same output time after time. But if you use a value of two, then it鈥檚 going to be very creative and random. So if you put the same input in, you鈥檙e going to get different outputs every time you run it.
I鈥檒l speak to the max completion tokens on the next slide and what that enables us to do. But I鈥檒l skip forward to the messages parameter here. So within messages, we specify the system prompt. And here we鈥檙e just saying your job is to categorize given job title into one of the following personas. We鈥檝e got C-suite, engineer, manager and other. And then we also give it the user value. And here that鈥檚 just going to be the job title, which we鈥檙e bringing in using the lead token.
And OpenAI give you something called the tokenizer. Whenever you make a prompt to OpenAI or any large language model, it transforms the characters and the words you send it into tokens. So we can see here that C-suite uses up two tokens. And I put in the three other personas we had, like engineer, other and manager. All four of these personas, they only consume two tokens at max. So that鈥檚 why I鈥檝e constrained the max tokens parameter here to be two because sometimes large language models can be a bit verbose and they give you back more than you want. So in order to prevent that, and to ensure it only gives me one of those four personas, I鈥檓 constraining the maximum output tokens to be two.
And then the last part of our webhook configuration is we need to set our API key here. So in that authorization header, where you see bearer and then the three Xs, you鈥檙e going to replace those three Xs with your OpenAI API key. And as I mentioned before, the link to get your API key from OpenAI will be in that champion blog post.
And then at the bottom, you鈥檒l notice that we鈥檙e mapping the Marketo persona field to a response attribute. And that response attribute choices zero message content looks a little bit complicated. But the only reason we need to do this is just because of the structure of the data we receive back from OpenAI. So you鈥檒l notice that it gives us a choices array. And then we want to get the first index of that array, so index zero, then we want to get the message parameter. And then within the message parameter, we want to get the content value. And that鈥檚 what actually contains the persona we鈥檙e looking for. So this notation here is just allowing us to get the message parameter value we want from OpenAI.
So now we know how to set up our webhook. We鈥檙e going to move towards creating a training data set that we鈥檒l need to create the fine-tune model in OpenAI.
So the way we鈥檙e going to do this is we鈥檙e going to create a smart list. So for the use case where we want to categorize spam form fills, we鈥檙e going to extract 30 days worth of contact sales data to train our model. So here I鈥檓 including all the form fields that are present in that form. I鈥檒l speak to PII and data privacy concerns with LLMs a bit later on. But for now, know that if you鈥檙e concerned about PII and sharing that with large language models, then you can remove the full name and email field, and then only send the company, phone, website, and additional information fields, because those alone should be enough for the large language model to detect whether the form fill was a spam submission or not.
So we鈥檙e going to export this data from Arquetta. We鈥檙e going to download it as a CSV.
And then we鈥檙e going to import it into a Google Sheet. So one thing to note is that the file we鈥檙e going to upload to OpenAI later on, it needs to be a JSON lines file or a JSONL file for short.
And they say you need at least 10 examples in order to create a fine-tune model. 50 are recommended. And if you鈥檝e got near 100 examples, I鈥檇 recommend splitting it 80-20. So you鈥檒l have 80% in a training dataset, and then you鈥檒l have another 20% of your examples in a validation dataset.
And with the structure of your slides here, so the user value, in this case, this is going to be all the job titles that you brought in from Arquetta. And then the assistant value is going to be the desired output that you鈥檇 like this job title to be matched to.
And it鈥檚 important here when you鈥檙e defining the assistant column, you鈥檒l manually have to go in and say for each job title, this is what I want the persona to be.
And you should be consistent here. So let鈥檚 say, for example, you鈥檝e got a software manager. If you call that, and if you map that to the manager persona in one row, but then later on, you see the software manager again, and then you map that to the engineering persona, that鈥檚 going to confuse the AI model because it鈥檚 inconsistent. In one place, you鈥檙e saying it should map to the manager persona, and the other you鈥檙e saying it should map to the engineering persona. So that鈥檚 going to confuse it. So be as consistent as possible here when you鈥檙e going through and you鈥檙e manually setting the assistant value for each row here.
And then the system prompt, this is going to be the exact same across all the rows in your spreadsheet. So you can kind of just fill it out once at the top and then drag it down.
And then we鈥檒l use a Google Sheets formula, which I鈥檒l show you in a second when I hop out of the presentation, to join the system prompt, the user value, which in this case is the job title, and the assistant value, which in this case is the desired persona. We鈥檙e going to join those all together with a formula in the JSON field here. So we鈥檙e going to form a JSON object. And then all these lines in the Google Sheet, which is the JSON object, that鈥檚 what鈥檚 going to form our JSONL file that we upload to OpenAI later on.
So I鈥檓 going to jump out of the presentation now to show you what this looks like in Google Sheets.
So once we鈥檙e in Google Sheets here, you鈥檒l notice that we鈥檝e got a lot of gray columns. And these gray columns are essentially helper columns that are going to help us form the desired JSON object that we need to upload to OpenAI. So this is the sheet for the, how did you hear about us mapping to the desired source values. And it looks the exact same whether we鈥檙e doing the job title mapping, or whether we鈥檙e doing the spam form for the classification. The layout of the sheet is the exact same every time. So I鈥檒l use the job title one since it鈥檚 the simplest one as my example. So we鈥檝e got our system prompt here, which as I mentioned before, you can just drag down all the way for all the rows of your sheet, we鈥檝e got all our job titles here. And then we have to manually come in here and type out what assistant, we have to manually type out in the assistant column, what persona we want each of these job titles to match to. And then these helper columns A through D here, they鈥檙e concatenated using this concat function, along with the system user and assistant values to form the required JSON object that OpenAI needs.
And you鈥檒l notice here that I鈥檓 using something called, it鈥檚 the two JSON safe function. This is a custom function, which I鈥檝e created here, because sometimes there can be illegal characters present.
Particularly, let鈥檚 say for example, in this form additional, information field, people can put new line characters in here, they can put double quotes, and those kind of illegal characters. Once we concatenate them together and put them here in the JSON object, they could cause it to break JSON syntax and would later on be rejected by OpenAI.
So this is a function that basically just replaces any of these illegal characters like backslashes, double quotes, new lines, carriage returns and tabs. It just replaces them so it won鈥檛 break JSON syntax. And we call that here when we鈥檙e doing the concatenation. So we鈥檙e calling it on the user value, we鈥檙e calling it on the assistant value, and we鈥檙e also calling it on the system prompt.
So that will ensure that this JSON object we create is valid.
And before we download our JSON lines file, is just copy all of these lines.
And then we鈥檙e going to paste them into the sheet here, and then click this big validate button. And you can see it says the input is valid JSON lines format. So this is exactly what we鈥檙e looking for. However, if there was some sort of issue in here, let鈥檚 say for example, I just delete something here.
It鈥檚 now telling me invalid JSON on line nine.
So that means I have to go to line nine. So in this case, that鈥檚 going to be this one here. And then I know okay, it鈥檚 line nine, which is correspondingly like row 10 in my spreadsheet. I know I need to look at this particular row to find the error. And to help us with that, we can copy it in here.
So obviously, this is correct, because I didn鈥檛 make the same delivery. I made the same deletion I made in this browser in the Google Sheet. So this is still accurate. But if I, let鈥檚 say delete a character here, the nice thing is, it flags for you where the error is occurring, and it鈥檒l give you an error message down here.
And if you don鈥檛 know enough about JSON syntax to fix it, that鈥檚 where large language models are a godsend, you can copy and paste this JSON body here. And you can paste the error message into chat GPT, or you can use Claude or any large language model you want, like Grok offered by X, you can paste it in there, and it will help you fix the JSON object. And it鈥檒l actually, you could prompt it just to give you the fixed JSON object, then you copy that.
And then that鈥檚 what you鈥檇 use here in column N.
And then the final part of this is once you鈥檝e validated that all of these JSON objects are in the correct format, you can copy them.
And then here, we鈥檙e going to download these values as a JSON lines file. And we just click Download here. And then this is going to be the JSON line file that we鈥檙e going to import into OpenAI in the next step. So I鈥檓 going to jump back into the presentation now and show you exactly how to do that.
OK, so I just showed you option one, which is using a Google Sheet to manually create the JSON file you need. I will also say that in the champion blog post, which I share in the resources, I share Python code that makes it much easier to create the desired JSON file format. All you need for the Python code is to give it the user column and the desired assistant column or the assistant values. So in this case, you can see here we鈥檙e going to give it all the job titles in the user column and then all the corresponding personas in the assistant column. That鈥檚 all it needs.
And then you just need to modify the system prompt if you want to. And then you just need to also modify the locations of all these files. So you can see here it鈥檚 referencing my local downloads folder. You鈥檇 obviously want to change that to your own folder. And you can rename some of these files here. But once you鈥檝e done that, it鈥檚 going to create the JSONL file for you in the location that you specify in the train output file and the validate output file. And it鈥檚 also nice because if you鈥檝e got more than 100 examples, it鈥檒l automatically break your data set 80% in the training output file and 20% in the validate output file. So if you鈥檙e competent with running code, I recommend that as the method to create your JSON lines file. And even if you鈥檙e not, it鈥檚 becoming easier and easier nowadays to use an LLM to guide you through running and executing code. So I鈥檇 explore this option if you鈥檙e comfortable using LLMs for programming.
OK, so now we鈥檙e getting to the, I don鈥檛 know if you could classify this as the more exciting part, but maybe the most crucial part of the whole presentation, which is actually creating the fine-tuned model in OpenAI.
And before we talk about creating the model and I show you how to do that, I wanted to speak to their data retention policies and security practices. When you share data with OpenAI when you鈥檙e using the API, it does not use your data to improve its models. And in Marketo, when we鈥檙e making the webhook request to OpenAI, we鈥檙e using its API. So in this case, whatever data we send it over the webhook is fine. It won鈥檛 use it to improve its models.
And when you鈥檙e using chat-gbt in the browser, it may use the data for model improvement, but there鈥檚 an option to deactivate this in the settings. And as I鈥檒l show you on the next slide, if you have a paid company account, there鈥檒l always be a little blurb at the bottom of the chat interface saying that it doesn鈥檛 use your company鈥檚 data for training its models.
Whether you use the API or the browser, OpenAI will always retain your data for 30 days just to make sure you鈥檙e not violating any of their policies.
But this is my high-level overview. If you have any concerns, I鈥檇 recommend talking to the compliance officials in your company and seeking legal advice. And another best practice step is to anonymize the data wherever possible. Like I showed you earlier on, if you have a paid company account, you can remove the email address so it鈥檚 not associated with a person.
And this is what I was referring to before. If you have a paid company account, you鈥檒l always see this blurb at the bottom which says, OpenAI doesn鈥檛 use Telnet鈥檚 official workspace data to train its models.
OK, so now we鈥檝e gotten all the legal stuff out of the way. I鈥檒l show you what the interface looks like when you鈥檙e creating a fine-tuned model. So the base model is the one you choose. In this case, I鈥檝e just chosen ChatGPT 4.0. The suffix is, you can set this to whatever you want it to be, but make sure it鈥檚 something that will easily correspond to what the fine-tuned model does. So in this case, it鈥檚 persona matching. And then I just put the date. I鈥檓 creating the fine-tuned model. You can leave the seed blank. Its job is basically just to help increase the reproducibility of the outputs of the fine-tuned model.
But if you leave this blank, OpenAI handles it for you. So I鈥檇 recommend just leaving it blank. You can upload your training data file and then your validation data file. As I mentioned, you should have an 80-20 split between those. But I鈥檇 only worry about splitting it this way if you鈥檝e got about 100 examples.
And then leave the hyperparameters and then you can just go back and click on this. You鈥檙e very familiar with how LLMs work in the back end. And you鈥檙e very familiar with what each of these does. I鈥檇 recommend just using the auto configuration here.
And then hit Create. And then go off and do something else because it usually takes a while for the fine-tuned model to generate. But they will email you once the model has been completed.
And then you can come in. And then you鈥檒l see that the output model ID, this is the value we鈥檙e going to copy and we鈥檙e going to use in our Marketo web hooks like I showed you earlier on. And then you鈥檒l also see the hyperparameters that were used here. So if you want to improve performance later on, you can try and tweak all of these values.
And when you鈥檙e trying to decide if performance is better from one model compared to the other, what you鈥檙e going to look at is this metrics section here. And you want these values to be as low as possible.
You want them to be very low decimals. So like 0029 is quite good, but I鈥檝e seen better.
And you can also, if you鈥檙e very curious, you can ask OpenAI what each of these values means and how to improve each one. And it鈥檒l give you recommendations for that. But the main guiding principle here is that the more examples you have, the better. The more consistent you are when you鈥檙e defining a certain column like I showed you earlier on, the better the performance.
So bear those things in mind when you鈥檙e trying to improve performance.
I mentioned web hooks as the main method for which we鈥檙e going to make these requests to OpenAI. But there are two main limitations of this. The first is that OpenAI can sometimes take a while to return the results to Marketo. And in that case, we could run into a 30-second timeout limit because if it takes that long, Marketo thinks the web hook has failed. So then you鈥檒l never get the categorized data that you were looking for.
And then the second issue is in Marketo, we can鈥檛 dig any deeper than the content parameter that鈥檚 returned from OpenAI. So even if OpenAI gives us a here source and a here source detail within the content parameter, unfortunately, we can鈥檛 dig any deeper in Marketo. So this would have been nice for the how did you hear about us categorization where we could have a here source value like organic, and then we could have a here source detail value like Google.
It鈥檇 be really nice to get that extra level of detail. But unfortunately, with web hooks, it鈥檚 not possible to dig in and get this. So the solution I recommend if you鈥檙e running into either these two issues, the 30-second timeout, or if you want to be able to map multiple values from OpenAI to multiple Marketo fields, then I recommend using self-service flow steps to get around these issues.
Okay, so what do I want you to take away after looking at today鈥檚 presentation? Hopefully, you鈥檝e been inspired by the three use cases I showed you today. And they鈥檝e started you thinking about different challenges you鈥檙e having with data categorization at your organization. So I鈥檇 like you to pick one of them, and then export the relevant data from Marketo Engage using a smart list like I showed you before. Then I want you to prepare your data for fine tuning using either the Google Sheet approach, like I demonstrated, or using the Python script that鈥檚 shared in the Champion blog. Then you鈥檙e going to go to OpenAI and create your fine tune model. You鈥檙e going to get that output model ID. And then that鈥檚 what you鈥檙e going to use in your Marketo Engage webhook or your self-service flow step. And then finally, you鈥檙e going to call this webhook in your smart campaign to categorize your data and finally start reaping the benefits of AI for data categorization. So that鈥檚 everything I want you to take away from today鈥檚 presentation. Thank you for your attention today. I hope this was beneficial for you. And I鈥檓 now happy to answer any questions that you might have.
Thanks so much, Tyran. So much to take away from that discussion. But before we move on to our next session, I want to give you the opportunity to answer some questions from the audience. So I鈥檓 going to ask you a few questions. And if you鈥檙e ready, if you鈥檙e in the chat, go ahead and submit some now because Tyran鈥檚 going to answer them live. And then Tyran, I鈥檒l read this to you, then you can answer them for the audience. Sure, sounds good.
Great. What sort of maintenance do you have to do to maintain these models that you created? Not a lot of maintenance. Once you set it up, it鈥檚 continually run in the background. But if you ever do want to retrain the model, if you see some examples coming through and you鈥檙e not happy with the output, you can take those examples, use them to retrain the model, and then you鈥檒l get a new URL for the fine-tune model and you can swap that in. So that would really be the only time you鈥檇 need to update the webhook. Or like if ChatGPT just released a new model, like the five model last week and you want to update to that, that would be another example. But for the most part, it鈥檚 pretty maintenance-free.
Great, great, great. And then what are your recommendations if you鈥檙e trying to pull in like non-native language or non-English information into the model? And how does that language differentiate or update based on it coming in Japanese and you want to output it in English or anything like that? Yeah, I saw the question about the Japanese job titles. And I鈥檇 say that鈥檚 the power of AI, where it鈥檚 not going to convert from Japanese to English and then try and do the persona matching. It鈥檚 smart enough to understand what the Japanese job title means and map it to the correct persona. So that鈥檚 the power of using AI for this job title matching, is it鈥檚 smart enough to understand all the cultural context or whatever there is behind a job title and map it to the correct persona that you have.
Cool. Very great.
So we had a question for this, similarly for Josh, about using third-party data. And then how would you incorporate something like a ring lead for data categorization or fuzzy matching? And how do you use that with OpenAI to create superior or data quality improvements, anything like that? The nice thing about using OpenAI is you鈥檝e got a lot more control, I鈥檇 say, than if you鈥檙e using ring lead, because with the fine-tune model that I showed, you can train it on examples to say, this is exactly what we saw in our Marketo instance, and this is what we want the response to be. And then you can obviously use that fine-tune model in your Marketo instance, so you鈥檙e mapping job titles to persona in the exact same way that you want, and you have full control over how that mapping is done. So I鈥檇 say maybe that鈥檚 the advantage of using OpenAI, is you鈥檝e got full control over the prompt you give the AI and all the training examples that you give it.
Great. Great, great, great. All right, on to the next question. Talking about more of that persona matching, we were just discussing, can you have it use a combination of like job title, job role, and department? Absolutely, yep. In that Google Sheet example I showed you guys where you鈥檙e bringing in all the values you exported from Marketo in one column, you just need to bring in those two extra fields of job role and department, and then when you鈥檙e modifying the assistant column, also specify the desired output for all those job title, job role, and department combinations. So you just have to bring in the extra fields, basically, and then you can do that in two states, but then it will work right out of the bat, same as before.
Awesome. Great, great, great. The next one then is, I鈥檓 assuming that you鈥檝e created some sort of step-by-step guide people can follow to build this on their own, knowing you. Where can they get that information from you from? Yep, so that 2JsonSafe function, there鈥檚 a Marketo champion blog that should have just been released today on how to use fine-tuned models with Marketo engage, and in that blog post, I linked to GitHub, where I share the Google script you need for that 2JsonSafe function, so you can just copy it straight from there and put it into the Google Apps extension.
Awesome. Great, great, great. And then, speaking of JSON, how do you validate that JSON format when you send it through the webhook, so it鈥檚 on the Forbes submissions when they come through? When you鈥檙e configuring your webhook in Marketo, there鈥檚 an option called Request Token Encoding. And in that drop-down, you can select JSON. So then, if there are any illegal characters like I was talking about before, that will convert them to a JSON-safe version, so your webhook will always send successfully to OpenAI.
Sweet, sweet, sweet. Awesome.
Now, do you do any self-service flow steps with this, or is it all webhooks? That鈥檚 one advantage of self-service flow steps is that sometimes OpenAI can take a long time to send the information back to Marketo. And if it takes longer than 30 seconds, Marketo will mark the webhook has failed. So if you鈥檙e seeing that quite often, then that鈥檚 where you should switch to self-service flow steps because they circumnavigate that 30-second timeout limit. So if you send the payload to OpenAI, OpenAI can take as long as it wants to send back the response to the self-service flow step, so you鈥檒l never run into that timeout issue. And it also gives you a lot more flexibility, like if you want to map to multiple fields from the OpenAI response, like a here source and a here source detail for your attribution, you can only do that with a self-service flow step because with the Marketo webhook configuration, you can only map to a single field from the OpenAI response, but using a self-service flow step unlocks the ability to map the OpenAI response to multiple fields.
Got it, got it, got it. All right. Speaking of OpenAI, next question. How do you suggest someone get started? Obviously, you are not on a get started. You鈥檝e been doing this for a long time, very much an expert. But let鈥檚 say someone is green, they鈥檙e a novice in this. Where would you recommend they get started? I鈥檓 assuming you probably have some more content to share with them. I do, yeah. There鈥檚 a blog post in the resources for this presentation, which is called Integrating Marketo with Chat GPT. And that walks through the basics of where do I get my OpenAI API key that I need to use when I鈥檓 configuring the Marketo webhook. And then it sets you up by walking you through how to create that webhook, reference the OpenAI API key, and it uses the example we went through today of categorizing your contact sales form fills. So that鈥檚 the place to start. It walks you through all the setup in your OpenAI account and then in Marketo to get started.
Great, great, great. Can you combine a couple of different data normalization operations using a single model, or do you recommend fine-tuning each model so that it鈥檚 specific for each use case? I think specificity is usually better.
In order to answer this properly, I need more information on the state of normalization you鈥檙e trying to do. But in general, if you鈥檙e trying to categorize your attribution data like the here source, it鈥檚 better to do that with a separate model than trying to combine it with one that does contact sales form fills or one that does the job titles. I think it鈥檚 best to have a specific model for each one.
Got it, got it, got it. Now, have you utilized OpenAI to do any like spam identification so it鈥檚 not passing bad fields over or any bad values over into your Marketo instance or over to your CRM? Yeah, it鈥檚 similar to the contact sales example I walked through today where you could get it to validate all the fields that have been submitted. And then if you know there are certain things to look out for, you can prompt it to look for them and then kind of send back a flag to change it. I haven鈥檛 done any specific use cases like that. I have done the one where I鈥檓 just trying to categorize in general is this a spam submission or a genuine one. But I鈥檓 glad someone asked the question because it seems like they鈥檝e started to think of ways that they could use OpenAI for this sort of task. But yeah, that鈥檚 definitely something you can prompt it to look for when you do send it all your Marketo information. Like these are the red flags for certain fields. Please highlight these for us so we can take a look.
Got it, got it, got it. How long did it take you to like train your models using OpenAI to get this all ready to go? Yeah, that鈥檚 the most intensive part of the fine tune models is just creating that, getting all the examples basically of, these are the inputs, these are desired outputs. It just takes time to manually go row by row and do all of that.
So I鈥檇 say for some things I鈥檝e worked on, it would take me maybe like an hour to two hours just to get all the data in the sheet. But then once you have the data, it鈥檚 very easy to use the Python script or that Google Sheets process I showed in the video to upload it to OpenAI. And then it just takes a while to generate the model. So creating the model once you have the data set is very easy. But the most manually intensive and time consuming part is just manually saying for all these inputs, this is the desired output. That鈥檚 the most time consuming part.
Got it, got it, got it. Well, Tyran, thank you so much for answering our questions. We鈥檙e going to end it there. We will see you soon. And if everyone has any more questions, Tyran鈥檚 on LinkedIn, shoot him a message. That鈥檚 very active there. Thanks, everyone. Appreciate it. See you, Tyran.
AI Use Cases for Data Categorization
- Spam Detection AI models outperform CAPTCHA, reducing false positives/negatives and saving sales teams time.
- Persona Matching AI accurately maps job titles (even with misspellings or in other languages) to personas, improving lead scoring and segmentation.
- Open Text Field Categorization AI buckets diverse attribution sources, handling misspellings and languages, enabling richer insights and reporting.
- Customization Fine-tuned models allow you to define rules and explanations for each categorization, giving you full control over outcomes.