Building a fully automated AI LinkedIn Research & Outreach Agent

November 13, 2025

The development of goal-oriented AI agents is driven by the need to bridge the gap between simple data retrieval and advanced reasoning capabilities. Traditional AI systems excel at retrieving and summarizing information but struggle with understanding and decision-making, which are essential for generating insights akin to a chef creating a unique dish from basic ingredients. This evolution aims to enable AI to connect data points, contextualize information, and make informed decisions.

The transition from systems that “search + summarize” to those that comprehend context and intent marks a significant shift. Unlike their predecessors, these advanced systems can understand nuances such as slang and sarcasm, personalize responses, and maintain a natural conversational flow. This is achieved by leveraging large language models (LLMs) and contextual memory, which help map language patterns to probable user intent.

Project Summary

In this blog we developed a Streamlit application designed to enhance lead generation by processing natural-language descriptions of desired leads. The app returns prioritized LinkedIn profiles, enhanced with AI analysis, and generates ready-to-send outreach emails. This is achieved through the integration of several key technologies.

Groq is employed for its intelligent keyword extraction and profile enrichment capabilities. Utilizing Groq’s AI acceleration technology, the app can efficiently score and analyze LinkedIn profiles while suggesting optimal outreach strategies. This ensures that the leads provided are not only relevant but also actionable.

For web scraping, Apify plays a crucial role. It performs web search scraping to identify LinkedIn profile URLs and uses a LinkedIn-specific actor to extract structured data from these profiles. This includes names, job titles, and other pertinent details that are essential for lead qualification.

The personalization of outreach emails is handled by a UBIAI-fine-tuned model. This model, already trained and deployed, crafts personalized emails tailored to each profile and the initial query, enhancing the likelihood of engagement and conversion.

Additionally, the app offers a feature to generate a blog post that summarizes the search process, findings, and outreach outcomes. This content can be downloaded in markdown format, providing a comprehensive overview of the lead generation efforts.

1. Introduction: Why an AI Research & Outreach Agent

The traditional approach to prospecting is hampered by its inefficiency, often consuming substantial time without yielding deep insights. 

Sales representatives report spending a significant portion of their time on non-selling tasks, highlighting the need for smarter, AI-driven solutions. Manual methods frequently result in superficial understanding, leading to the pursuit of unqualified leads, which can waste valuable resources.

AI-driven automation addresses these challenges by streamlining lead generation processes, allowing for both the creation of prospect lists and the enrichment of profiles. This automation is crucial, as companies using AI have seen a 50% increase in sales-ready leads and a notable reduction in sales-related costs.

Beyond mere data scraping, an AI research and outreach agent must extract context, assess relevance, and deliver personalized content that aligns with a brand’s voice. Generic outreach strategies are ineffective in today’s market, where hyper-personalization is necessary to engage potential clients. Personalized emails, for instance, result in six times more transactions and boast a 29% higher open rate. AI tools, such as AI Brand Voice Generators and platforms like Acrolinx, ensure that the messaging remains consistent with the brand’s unique style across all communication channels.

The primary goals of deploying an AI research and outreach agent include enhancing the efficiency of lead discovery, minimizing false positives, and generating high-quality, context-aware outreach drafts. AI can automate repetitive tasks, improving sales productivity and enabling a 10-15% boost in operational efficiency. By employing predictive analytics, AI reduces false positives and enhances conversion rates by focusing on high-potential prospects. Firms using AI report up to a 50% increase in lead generation and significant improvements in profit margins. However, businesses must be mindful of potential pitfalls such as over-reliance on AI and privacy compliance issues, which can be mitigated through careful data management and human oversight.

2. End-to-End Workflow Overview
High-level Pipeline Steps

User Inputs Natural Language Description: The process begins when a user enters a detailed natural-language description of their target leads, such as “CTOs in fintech, London, machine learning.” This leverages Natural Language Processing (NLP) to understand and analyze vast amounts of unstructured data, allowing businesses to identify potential leads and tailor marketing strategies accordingly.

Groq Extracts Focused Keywords: Groq uses advanced keyword analysis to distill the input description into focused search keywords. This deterministic extraction ensures that search queries are precise and consistent, utilizing statistical measures like TF-IDF and BM25 to prioritize relevant keyphrases.

Apify Web-Scraper Queries LinkedIn: Apify’s web-scraper executes site-specific queries on Google and Bing, targeting LinkedIn. It extracts profile links and usernames by searching for the most relevant LinkedIn pages, without requiring cookies, thus simplifying data collection.

Detailed Profile Data Extraction: For each extracted username, the Apify LinkedIn Scraper actor is invoked to gather structured profile data. This includes comprehensive information such as basic info, experience, education, and current company details, facilitating a thorough understanding of each lead.

Groq Analyzes Profiles: Groq evaluates each profile against the original query, assigning a numeric relevance score and providing a short analysis with an outreach suggestion. This JSON-only output is driven by AI and machine learning to enhance the accuracy and efficiency of the relevancy scoring process.

UBIAI Generates Personalized Outreach Emails: The UBIAI-fine-tuned model crafts personalized outreach emails for each profile. It uses the profile context and the initial search query as input, ensuring that the tone, style, and messaging remain consistent with the user’s brand voice.

Why this Sequence?

Keyword Extraction Upstream:

Implementing keyword extraction at the beginning of the workflow optimizes the search process by reducing irrelevant results and narrowing the scraping scope. By focusing on salient terms, this approach enhances the accuracy and efficiency of information retrieval. Techniques such as TF-IDF, linguistic analysis, and machine learning models are employed for keyword extraction. Recent advancements combine statistical methods with pre-trained embedding models, resulting in improved performance. For instance, a study on product attribute extraction demonstrated that fine-tuning with just 200 samples increased model accuracy from 70% to 88%.

Separation of Retrieval and Reasoning (Groq): In Retrieval-Augmented Generation (RAG) systems, separating retrieval from reasoning helps mitigate hallucinations and provides structured evidence for decision-making. This architecture enhances the system’s ability to contextualize information and connect multiple data points, reducing inaccuracies. The reasoning component evaluates the relevance of retrieved data, discarding irrelevant information. Orion Weller from Johns Hopkins University emphasizes the need for embedding instruction-following and reasoning directly into retrieval models to improve accuracy. By grounding responses in factual data, RAG systems significantly lower the chances of generating fabricated information.

Fine-tuning the Email Generator on UBIAI: Fine-tuning email generators on platforms like UBIAI ensures outputs maintain a consistent tone, brevity, and relevance, unlike generic prompt-based models. This process involves training a pre-existing language model on a task-specific dataset, aligning it with the desired communication style. UBIAI’s annotation tools aid in creating precise datasets, crucial for specialized tasks. The process includes collecting high-quality emails, structuring them consistently, and using tools like GPT-3.5 to convert emails into bullet points, which serve as inputs for training. This method allows for the generation of professional, tone-consistent emails tailored to specific needs.

3. Streamlit UI & Session Flow (what the code does)

 

Page and Sidebar Configuration

import streamlit as st

# Page Configuration

st.set_page_config(

page_title=”Deep Research Agent”,

page_icon=”:mag:”,

layout=”wide”,

)

# Sidebar Inputs

groq_api_key = st.sidebar.text_input(“Groq API Key”, type=”password”)

apify_api_token = st.sidebar.text_input(“Apify API Token”, type=”password”)

include_email = st.sidebar.checkbox(“Include Email in Results”)

max_profiles = st.sidebar.slider(“Max Profiles”, min_value=1, max_value=100, value=25)

st.sidebar.markdown(“## Research Pipeline”)

st.sidebar.write(“This agent uses Groq’s LLMs to process and analyze data efficiently…”)

# Validation Checks

if not groq_api_key:

st.error(“Groq API Key is required.”)

st.stop()

if not apify_api_token:

st.error(“Apify API Token is required.”)

st.stop()

# Styling for the Sidebar

st.markdown(

“””

<style>

[data-testid=”stSidebar”] {

background-color: #f0f0f5;

padding: 20px;

border-right: 2px solid #ccc;

color: #333;

}

</style>

“””,

unsafe_allow_html=True,

)

Page Configuration:

Utilizes st.set_page_config() to set the page title to “Deep Research Agent” and the icon to a magnifying glass emoji. The layout is set to “wide” to accommodate results panels effectively.

Sidebar Inputs:

Includes password-type text inputs for Groq API Key and Apify API Token. A checkbox for “Include Email in Results” and a slider for “Max Profiles” are added for user customization. The sidebar also contains a markdown section explaining the research pipeline.

Validation Checks: Ensures the presence of API keys, displaying user-friendly errors and halting execution if keys are missing, using st.error() and st.stop().

Styling: Customizes the sidebar appearance with CSS for a polished look.

Chat Input and Conversation State

Streamlit’s st.chat_input is an essential component for building conversational interfaces, enabling users to input natural language queries such as ‘Describe the leads/profiles you want…’. This widget supports various customizations, including setting a placeholder text, defining a unique key, and limiting character input. You can also control the widget’s width and configure it to accept file uploads, with each file limited to 200 MB. However, it doesn’t support multiline placeholders and cannot be disabled during response generation by a language model.

The conversation history is maintained using st.session_state.messages, ensuring continuity across app sessions. This list of dictionaries stores each message, with keys like role (indicating the message author) and content (holding the text). Here’s how to initialize and utilize this feature:

if “messages” not in st.session_state:

st.session_state.messages = []

for message in st.session_state.messages:

with st.chat_message(message[“role”]):

st.write(message[“content”])

# Appending a new user message

prompt = ‘Your input here’

st.session_state.messages.append({“role”: “user”, “content”: prompt})

These practices are essential for maintaining a coherent chat experience. Proper initialization of st.session_state is crucial, and descriptive keys should be used for clarity. While st.session_state is effective for managing conversation state, avoid using it for large datasets, opting for st.cache_data instead.

Flow Control and Progress UX

Streamlit offers powerful tools for managing flow control and enhancing user experience, crucial for applications involving tasks like data scraping, enrichment, and email generation.

Progress Bars and Status Placeholders: The st.progress function visually represents real-time task advancement. It should be initialized with st.progress(0) and incrementally updated within a loop. For instance, updating a progress bar during a data scraping operation gives users a clear indication of task progress. Meanwhile, st.empty serves as a placeholder for dynamic status updates, replacing elements as tasks advance. These elements work together to offer clear and timely feedback to users.

Per-profile Expanders: The st.expander function provides a collapsible container to organize detailed information. This is particularly useful for displaying comprehensive data per profile, such as scores, analyses, outreach emails, and complete JSON data. Expanders maintain a clean interface, allowing users to inspect details only when needed, thereby preventing information overload.

Color-Coded Scores and Conditional UI Elements: Streamlit supports color customization using markdown syntax, which is useful for visualizing scores (e.g., green for high scores, orange for medium, red for low). Conditional UI elements enhance clarity, such as displaying a profile image only when available or showing an email address if included. This dynamic presentation helps users focus on relevant details, improving overall user experience.

4. Keyword Extraction with Groq (detailed)

Prompt Design and Examples

To effectively extract keywords using Groq, the prompt design must instruct the model to deliver only keywords, separated by spaces. This approach ensures a clean, parsable output devoid of unnecessary text. An example of this could be transforming the sentence “AI researcher from Oxford working on computer vision” into the keyword list “AI researcher Oxford computer vision.”

Incorporating multiple examples within the prompt serves to bias the model towards identifying specific types of keywords such as job titles, roles, skills, and institutions. This technique, known as few-shot learning, helps the model grasp the desired output format by showcasing diverse examples. For instance, examples should cover a range of job titles (e.g., “Data Scientist,” “Marketing Analyst”), skills (e.g., “Python,” “communication”), and institutions (e.g., “Stanford,” “Google”).

To achieve deterministic outputs, temperature and token settings are pivotal. A low temperature setting, such as 0.2, reduces randomness in the model’s predictions, making the output more predictable and focused. Although setting the temperature to 0 aims to eliminate randomness, it doesn’t guarantee absolute determinism due to potential hardware and precision limitations. The max_tokens parameter, set to 150, confines the length of the output, ensuring brevity and relevance in the keyword list. These settings are essential for tasks requiring accuracy and consistent results, especially in professional contexts like startups and marketing.

Post-processing Rules

For effective keyword extraction using Groq, post-processing rules are essential to create clean and precise search queries. These rules involve normalizing whitespace, stripping markdown characters, truncating output, and ensuring verbatim usage in search URLs for Apify.

Normalize Whitespace and Strip Markdown Characters Using Regex: Whitespace normalization involves condensing multiple spaces, tabs, and other whitespace characters into single spaces, while removing leading and trailing spaces. This is crucial for consistent query formatting. For instance, using Python’s regex module:

import re

text = “This sentence has extra whitespaces. “

normalized_text = re.sub(r”\s+”, ” “, text).strip()

Markdown stripping removes characters like *, #, and !, which could interfere with search engines. Regex is effective for this task, ensuring queries are free from formatting noise that markdown introduces.

Truncate Output to a Maximum of 10 Words: To prevent overly broad searches that yield irrelevant results, the output is truncated to a maximum of 10 words. This focuses the search on the most pertinent aspects of the content. In Python, a simple implementation is:

text = “This is a long string of keywords that needs to be truncated”

truncated_text = ” “.join(text.split()[:10])

This approach ensures that the extracted keywords remain targeted and relevant.

Return Value Used Verbatim in Search URLs for Apify: The post-processed keywords are used verbatim in the search URLs for Apify, a platform for web scraping and data extraction. Using the exact keywords minimizes variability and ensures consistent search results. For example, a keyword like “best AI tools for marketing” would be encoded into a URL as https://apify.com/web-search?q=best%20AI%20tools%20for%20marketing, with %20 representing spaces. This verbatim usage is crucial for accurate data retrieval by Apify’s API.

Rationale

Providing curated examples and enforcing a strict output format in keyword extraction with Groq significantly enhances the process by reducing noise and improving the reliability of downstream scraping. By using high-quality datasets that are diverse, representative, and free from errors, the extracted keywords become more relevant and accurate. This is crucial for applications such as content tagging, search engine optimization, and document indexing, where precision is paramount. Additionally, controlling parameters like temperature and top-p can help manage the randomness and diversity of generated text, ensuring consistent results.

This approach also serves as a lightweight semantic parser that effectively maps user intent into actionable search phrases. Unlike traditional keyword-based search engines, Groq, in combination with AI-native search engines like Exa, understands the semantic meaning and context of queries. This capability allows for the development of intelligent search applications that can break down human language into components machines can analyze. By focusing on semantic relations and contextual understanding, keyword extraction moves beyond simple frequency counts to uncover the most relevant information from text.

In practice, these techniques enable organizations to better understand customer needs and preferences, translating user intent into business actions. For example, during user experience interviews, keyword extraction can identify prominent themes and sentiments, providing valuable insights into customer voices. By aligning keyword extraction with semantic analysis, businesses can achieve a more structured and meaningful interpretation of user data.

5. Searching for LinkedIn Profiles (Apify web-scraper)
Search Strategy

To effectively search for LinkedIn profiles using Google and Bing, a structured approach leveraging the site:linkedin.com/in operator and Groq-generated keywords is essential. This method confines your search to LinkedIn profile pages, ensuring relevant results. By combining this operator with specific keywords such as job titles, skills, and locations, you refine your search to target the profiles of interest. Boolean operators enhance precision: AND narrows results to include all terms, OR broadens them to any term, and NOT excludes specific terms. For example, using site:linkedin.com/in “SEO Consultant” “San Diego” will locate LinkedIn profiles of SEO Consultants based in San Diego.

Starting searches on both Google and Bing increases the scope and diversity of results, as each search engine may index LinkedIn profiles differently. This dual approach maximizes the chances of capturing a wide array of relevant profiles.

Limiting crawl depth to maxCrawlDepth=0 is pivotal for web scraping, as it restricts the scraper to only the search results pages without delving into individual profiles. This strategy efficiently gathers a broad spectrum of profiles, focusing on quantity rather than detailed data from individual pages.

To prioritize results from the United States, incorporate location-specific keywords like “United States” or “USA” in your queries. With a significant LinkedIn user base in the U.S.—230 to 252 million as of April 2025—targeting this demographic is strategic for accessing a substantial pool of potential connections.

Apify Actor Configuration (pageFunction)

The pageFunction in Apify is crucial for extracting LinkedIn profile data. By utilizing jQuery, the function efficiently identifies anchor tags (<a>) that contain ‘linkedin.com/in’, ensuring accurate data retrieval. The jQuery selector $(‘a[href*=”linkedin.com/in”]’) targets these anchors to extract URLs pointing to LinkedIn profiles.

Handling Google redirect URLs is another critical aspect. These URLs often include a url parameter that needs extraction and decoding to reveal the actual LinkedIn profile link. The function below demonstrates how to achieve this:

function extractLinkedInUrl(googleRedirectUrl) {

const urlParams = new URLSearchParams(googleRedirectUrl);

const linkedInUrl = urlParams.get(‘url’);

return decodeURIComponent(linkedInUrl);

}

Once the LinkedIn URL is obtained, a regular expression is employed to extract the username. The regex /linkedin\.com\/in\/([a-zA-Z0-9\-\_]+)/ captures the username from the URL, allowing the function to return structured data:

function extractUsername(url) {

const match = url.match(/linkedin\.com\/in\/([a-zA-Z0-9\-\_]+)/);

return match ? match[1] : null;

}

The pageFunction typically returns an object containing the URL, username, and additional information like the profile title:

{

url: “https://www.linkedin.com/in/example-user&#8221;,

username: “example-user”,

title: “Software Engineer at Example Corp”

}

To operate within Apify’s plan and LinkedIn’s rate limits, the maxRequestsPerCrawl and maxConcurrency settings are adjusted conservatively. This approach helps avoid IP blocking and ensures compliance with usage restrictions. For example:

const crawler = new CheerioCrawler({

maxRequestsPerCrawl: 500,

maxConcurrency: 20,

//… other options

});

Using proxies and managing sessions are additional strategies to prevent scraping detection. Apify provides tools like ProxyConfiguration and SessionPool to support these needs. As noted by experts, “using a proxy and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked.”

Post-processing & Deduplication

Efficient post-processing and deduplication are crucial when scraping LinkedIn profiles using the Apify web scraper, especially to handle large datasets without redundancy.

Streaming Dataset Items with client_apify.dataset(…).iterate_items(): This approach processes data incrementally, crucial for managing large datasets typical in LinkedIn scraping. By streaming data item-by-item, rather than loading it all at once, memory usage is optimized. For example, when scraping thousands of LinkedIn profiles for marketing leads, iterate_items() allows you to filter and process profiles based on criteria such as job titles or locations, like “marketing manager” or “San Francisco,” without memory overload.

Deduplication Using a seen_usernames Set: To avoid redundant scraping and outreach, maintain a seen_usernames set. This set tracks processed profiles, ensuring each username is only processed once. Before processing a profile, check if its username is in the set. If not, process and add it to the set; if it is, skip the profile. This method saves on Apify compute units, reduces the risk of LinkedIn rate limits, and provides cleaner data. For instance, before sending a connection request, verify if the username is already in the set to avoid duplication.

Stopping at max_profiles: Implement a mechanism to halt scraping after collecting the desired number of profiles, defined by max_profiles. This involves maintaining a counter that increments with each processed profile. Before scraping a new profile, check if the counter has reached max_profiles and stop if it has. This practice controls costs, prevents irrelevant data collection, and adheres to ethical scraping limits. For example, when targeting 500 profiles of software engineers in the US, the process stops once 500 profiles are collected, even if more are available.

6. Scraping LinkedIn Profiles (LinkedIn Scraper actor)

Actor Inputs and Expected Output Shape

The LinkedIn Scraper actor is designed to extract detailed profile data from LinkedIn using specific input parameters. The primary inputs required to run this actor are:

username: This refers to the unique identifier found in a LinkedIn profile URL (e.g., linkedin.com/in/username). It is essential for targeting a specific profile for data extraction.

includeEmail: As a boolean parameter, this dictates whether the scraper attempts to extract the email address from the profile. While some email addresses may be publicly available, others might require additional techniques to uncover, and it’s important to note the privacy implications of scraping email data.

The output generated by the scraper is a structured JSON object containing several fields within profile_data. These fields are categorized as follows:

basic_info: This section includes:

fullname: The person’s full name.

headline: A brief professional description.

about: Text from the “About” section of the profile.

location: Geographical area of the person.

current_company: Name of the current employer.

email: The email address, if available and requested.

is_influencer: Indicates if the person is recognized as an influencer.

follower_count: Number of followers the profile has.

profile_picture_url: Direct URL to the profile picture.

experience: A list of job roles, each entry containing:

title: Job title.

company_name: Employer name.

location: Job location.

duration: Employment period.

start_date and end_date: Job start and end dates.

description: Job role description.

education: A list of educational qualifications, with each entry detailing:

school_name: Educational institution name.

degree_name: Degree earned.

start_date and end_date: Duration of education.

This JSON output format is tailored for direct consumption in subsequent enrichment processes, often used by marketing teams to enhance and verify data accuracy.

7. Enrichment & Analysis with Groq (exact prompt contract)

Profile Context Construction

Profile context construction involves aggregating diverse data points into a comprehensive dictionary, providing a detailed snapshot of an individual’s professional background. This process is particularly valuable for startups, marketing professionals, and AI engineers seeking to understand target audiences or potential hires.

Data Aggregation into a Dictionary:

Key fields include username, linkedin_url, fullname, headline, about, location, current_company, email, is_influencer, and follower_count.

LinkedIn is a primary source for these data points due to its extensive professional networking focus.

Data is enriched by integrating behavioral data, third-party information, and predictive scoring to enhance profile accuracy.

Experience and Education Entries:

To maintain compactness, the top three experience entries and top two education entries are selected, with descriptions truncated.

This truncation minimizes computational resources and highlights the most relevant information.

For instance, a profile might list the most recent job roles and educational achievements, focusing on key responsibilities and accomplishments.

Technical Details:

Profile data is commonly formatted in JSON, facilitating easy integration into applications.

LinkedIn APIs, such as GET https://api.linkedin.com/v2/me, provide structured access to user data, though they require authentication and adhere to specific usage guidelines.

Third-party APIs like Bright Data and Lix API offer alternative data extraction methods, managing tasks like proxy rotation and rate limiting.

Profile Enrichment Importance:

Enriching profiles with additional data points beyond basic resumes enhances customer engagement by offering a holistic view of individuals.

Detailed data aids in targeted marketing strategies, leading to improved business outcomes.

Considerations and Best Practices:

Emphasize data accuracy, validation, and freshness to ensure reliable and up-to-date profiles.

Adhere to data privacy regulations and LinkedIn’s terms of service to avoid legal issues.

Enrichment Prompt Structure

Creating an effective enrichment prompt structure with Groq for lead generation analysis involves several critical components:

Role Assignment: Clearly define Groq’s role as “an expert lead generation analyst.” This establishes the model’s function and ensures it aligns with your objectives. As noted in GroqDocs, prompt priming sets the “temperature of the conversation room,” and guides the model’s behavior.

JSON-Only Output: Specify that the output must be in JSON format to ensure structured, predictable results. JSON’s versatility facilitates integration with other applications. According to Google AI for Developers, instructing the model to return only valid JSON with specific keys, such as score, analysis, and outreach_suggestion, is essential. This format reduces error rates and supports seamless integration.

Temperature and Token Settings: Use a temperature setting of 0.3 to prioritize accuracy and relevance over creativity, as suggested by Groq Chat Settings. Allow approximately 400 tokens to provide sufficient space for detailed reasoning and analysis without exceeding the model’s context length.

Scoring and Analysis: Assign a numerical score (1-100) to the lead based on their profile and the search query. The analysis should be concise, highlighting strengths and weaknesses in 2-3 sentences. Finally, suggest an outreach strategy in 1-2 sentences, tailored to the lead’s profile, to enhance sales effectiveness.

This structured approach ensures Groq provides a focused, actionable analysis of lead potential, aiding in streamlined lead qualification and boosting sales outcomes.

Parsing & Fallbacks

When dealing with JSON responses from Groq, it’s important to strip any enclosing triple-backtick fences before parsing. These fences often appear in structured outputs or code snippets returned by the API. The Python function json.loads() is typically used to convert JSON strings into Python objects. For instance:

import json

resp ““`json\n{\”key\”: \”value\”}\n“`” # Example response from Groq

# Remove triple backticks

js response_text.replace(““`json”, “”).replace(““`”, “”).strip()

try:

data = json.loads(json_string)

print(data)

except json.JSONDecodeError as e:

print(f”Error decoding JSON: {e}”)

If parsing fails, or if there’s an error from Groq, implementing a fallback mechanism is essential. This involves setting a default object with a score of 50 and a note for manual review. Here’s how you can handle these errors:

import json

try:

resp client.chat.completions.create( # Groq API call

messages=[{“role”: “user”, “content”: “Extract data”}],

model=”example-model”

)

js response.choices[0].message.content

data = json.loads(json_string)

score = data.get(“score”, None) # Extract score if present

except (json.JSONDecodeError, groq.APIError) as e:

print(f”Error: {e}”)

data = {“score”: 50, “review”: “Manual review needed due to parsing or API error.”}

score = data[“score”]

print(data)

print(score)

To mitigate rate-limit issues, introduce a short delay between API calls. Groq’s API employs rate limits, and exceeding them results in a 429 Too Many Requests error. A delay of 0.5 seconds can help avoid this:

import time

import groq

client = groq.Groq()

for i in range(10):

try:

chat_completion = client.chat.completions.create(

messages=[{“role”: “user”, “content”: “Tell me something”}],

model=”example-model”

)

print(chat_completion.choices[0].message.content)

except groq.APIError as e:

print(f”API Error: {e}”)

time.sleep(0.5) # Delay to prevent rate limiting

These practices ensure efficient error handling and compliance with API rate limits, enhancing the reliability of your application.

8. Email Generation Using the UBIAI Fine-Tuned Model

Our project leveraged UBIAI’s fine-tuning platform to transform a generic instruction-following model into a highly specialized email generation system capable of producing brand-consistent, concise, and personalized outreach messages. By fine-tuning on a carefully constructed dataset of real and synthetic outreach examples, the model learned to understand context from profile data and craft tailored emails under specific constraints—such as tone, word limit, and call-to-action style.

UBIAI enabled parameter-efficient fine-tuning through methods like Adapters and QLoRA, allowing us to optimize performance without extensive computational resources. We configured and trained the model with a balanced mix of automated evaluation (e.g., BERTScore) and human-in-the-loop testing to ensure alignment with communication goals. The result was a robust, deployable model that could be accessed via UBIAI’s inference API and integrated seamlessly into a Streamlit app for real-time, safe, and compliant email generation.

In essence, UBIAI served as the core engine that empowered us to convert data-driven insights into an adaptable, production-ready outreach model—one that blends efficiency, personalization, and brand authenticity at scale.

9. Data Privacy, Ethics & Compliance

Privacy Measures Taken

To safeguard personal data and ensure compliance with privacy regulations, several key measures have been implemented.

Sanitized Training Data: Training data is sanitized by removing or anonymizing personally identifiable information (PII) unless explicit consent is obtained. Techniques include data masking, swapping, generalization, suppression, pseudonymization, tokenization, noise addition, and synthetic data generation. For instance, noise addition, through differential privacy, adds mathematically calibrated noise to datasets, preventing re-identification. Synthetic data generation has shown to improve AI accuracy by 15% compared to anonymized real data, as demonstrated in a healthcare AI model that achieved 92% accuracy.

Minimal Audit Logging: Audit logs are maintained with minimal fields such as username, URL, and generated email, ensuring privacy while enabling audits. Access to these logs is controlled to prevent unauthorized modifications, with optional hashing providing an additional security layer. For example, HIPAA compliance in the healthcare sector necessitates specific audit trail requirements, including tracking logins and data modifications, with encrypted storage and automated monitoring as best practices.

Opt-Out/Deletion Workflow: An opt-out and deletion workflow is established in line with privacy laws like CCPA/CPRA and GDPR. These regulations allow individuals to request the deletion of their personal data, with deletion requests accounting for 39.6% of data subject requests under CCPA in early 2020. The California Delete Act mandates data brokers to process deletion requests through the Delete Request and Opt-out Platform (DROP) by 2026, addressing the cumbersome process of individual opt-outs.

These measures reflect a commitment to privacy and compliance, addressing challenges like re-identification risk and balancing data utility with privacy.

10. Final Notes and Cautions

Key Warnings

When engaging in scraping and automated outreach, startups and marketing professionals must be aware of the legal and platform-risk implications. In the United States, scraping isn’t regulated by a single law but rather by a patchwork of statutes like the Computer Fraud and Abuse Act (CFAA) and the Digital Millennium Copyright Act (DMCA). Scraping publicly available data is typically permissible, but collecting personal data without consent or breaching a website’s terms of service can lead to legal challenges. For instance, violating terms of service may result in being blocked or facing legal action, as seen in the HiQ Labs v. LinkedIn case. To stay compliant, respect robots.txt files, apply rate limiting, and avoid collecting personal data.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !