How to Build a Google Scraping AI Agent with n8n (Step By Step Tutorial)

How to Build a Google Scraping AI Agent with n8n
Google Scraping AI Agent

Google Scraping AI Agent

Overview

This document outlines the process of building a Google scraping AI agent using n8n, OpenAI API, and Google Sheets. The agent is designed to search Google for specific LinkedIn profile URLs (e.g., CEOs in real estate in Chicago) and populate these URLs into a Google Sheet. The solution involves two main n8n workflows: a "tool" workflow that handles the Google scraping and data parsing, and an "agent" workflow that interacts with the user and calls the "tool" as needed.

Key Concepts

1. Automated Data Scraping for Targeted Information

The core functionality of this system is to automate the extraction of highly specific data (LinkedIn profiles) from Google search results. This is achieved by crafting precise Google search queries that filter results to only include LinkedIn profiles (site:linkedin.com/in).

  • "The agent... will go search Google and scrape Google to get those profiles into a sheet like this."

  • The agent successfully finds "CEOs in real estate in Chicago" and "Founders in technology in San Francisco" from LinkedIn.

2. Leveraging n8n for Workflow Automation

n8n serves as the central platform for building and orchestrating the entire process. It provides the visual workflow builder, pre-built nodes for integrations (OpenAI, Google Sheets, HTTP requests), and the ability to define custom tools and agents.

  • "n8n that is the website that we're building the workflows on for the tool and for the agent."

  • "We're building two workflows today the first one is going to be the tool that scrapes Google for the URLs and then the second one is the actual agent that we can interact with."

3. AI Integration (OpenAI) for Intelligent Processing and Interaction

OpenAI's API is crucial for two main purposes:

  • Parsing User Queries: An OpenAI node in the "tool" workflow parses natural language requests from the user (e.g., "CEOs in real estate in Chicago") into structured parameters (job title, company, industry, location) for the Google search.
  • Agent Intelligence: An OpenAI chat model (GPT-4o recommended) acts as the "brain" of the AI agent, allowing it to understand user requests, decide which tools to use, and formulate appropriate responses.
  • "We need to connect an open AI node... can you parse the Json query which is this information over here and then output it as the following parameters separately."

  • "We need to give him the brain which is a chat model which we always use open AI chat model."

4. Tool-Based AI Agent Architecture

The system employs a "tools agent" model in n8n. The main AI agent doesn't directly perform the scraping but rather calls a separate workflow (the "grab profiles" tool) designed specifically for that task. This modular approach enhances maintainability and scalability.

  • "It's going to be a tools agent because it's going to be accessing the tools that we give it."

  • The tool has a

    "description of when to be called: call this tool to get um yeah LinkedIn profiles."

5. Handling Complex Web Data (HTML Parsing)

Google search results are returned as a large chunk of HTML. A key challenge is to extract only the desired LinkedIn URLs from this raw HTML. This is solved using a "code node" in n8n, with the actual parsing logic generated by an AI assistant.

  • "This is just a ton of HTML huge chunk of nonsense that no one you know can really interpret."

  • "The next node is going to be a code node and this is going to parse the information that we're getting this nasty chunk of HTML and it's going to parse it to only return what we want which are the LinkedIn profile URLs."

6. Utilizing AI Assistants for Code Generation and Troubleshooting

A significant innovation highlighted is the use of an n8n assistant GPT (a custom ChatGPT model) to generate the necessary code for parsing HTML and setting up HTTP request headers. This greatly simplifies development for users without extensive coding knowledge.

  • "I came into chat gbt here and on the left hand side you've got explore gbt... I just typed in NN and it brought me to um basically an n8n assistant GPT."

  • "I basically asked like can you help me set up header for HTTP request node... and it's really helpful."

  • "Can you write code to parse out the LinkedIn information we're looking for here is the here is an example chunk of HTML being returned."

7. Google Sheets as a Database

Google Sheets serves as the final destination for the scraped data, providing a simple, accessible database for the LinkedIn profile URLs.

  • "Google Sheets... that's like I said the database where we're storing the URLs into."

  • "Now we need to get these URLs into our Google sheet so we're going to do that with a Google Sheets node."

8. Memory and Context for Conversational Agents

The AI agent incorporates "window buffer memory" to retain context from previous interactions, enabling more natural and coherent conversations with the user.

  • "This memory is just giving the agent context for what we're talking about so it's not just like a question and answer and then his brain is resetting."

9. Limitations and Potential Solutions (Google Search API)

The current method of directly scraping Google via HTTP requests has limitations, particularly regarding the number of results (typically around 10) and potential issues with CAPTCHAs from Google. The speaker suggests that a dedicated Google Search API (like SerAPI) would be necessary for more extensive scraping or pagination beyond the first page.

  • "If you guys are wondering sort of why we're only returning 10 responses and if you wanted to do more how could that work so my understanding is that we would need to be using a a an actual Google search API like Sur API in order to do that."

  • "Google's may be providing a capcha to to find out if it's a human or if it's really a robot scraping information."

---

Core Components and Setup Requirements

  • n8n: The primary platform for building workflows.
  • OpenAI API Key: Required for message parsing and AI agent intelligence.
  • Google Sheets: To store the scraped LinkedIn profile URLs.
  • Google Cloud Service Account (for Google Sheets): Needed for n8n to authenticate and write to Google Sheets (requires OAuth consent screen setup).
---

Workflow Breakdown

Workflow 1: "Grab Profiles" (The Tool)

  • When Called by Another Workflow Trigger: Initiates the workflow when the AI agent calls it, passing parameters like job title, company, industry, and location.
  • OpenAI Node (Message a Model):
    • System Prompt: Instructs the model to parse the incoming query (JSON format) and extract job title, company, industry, and location as separate parameters.
    • User Message: Provides the actual query data to be parsed.
    • Output: Ensures the parsed content is a string using json.stringify().
  • HTTP Request Node:
    • Method: GET (to retrieve information).
    • URL: Simple Google search URL.
    • Parameters:q: Constructs the search query by combining site:linkedin.com/in with the parsed job title, industry, and location.
    • Headers: User-Agent (set to a browser-like string to avoid being blocked by Google). This information was obtained using the n8n assistant GPT.
  • Code Node:
    • Parses the raw HTML response from the HTTP request node.
    • Extracts only the LinkedIn profile URLs. The code for this step was generated and refined with the help of the n8n assistant GPT.
  • Google Sheets Node (Append Row):
    • Connects to a pre-existing Google Sheet.
    • Appends the extracted LinkedIn URLs to a specified column (e.g., "Prospect LinkedIn URL"). Requires Google Cloud service account setup for authentication.
  • Set Node:
    • Sets a field named response with the value done. This signals to the calling AI agent that the tool's task is complete, allowing the agent to respond to the user.

Workflow 2: "AI Agent" (The Conversational Interface)

  • Chat Message Trigger: Initiates the workflow when a user sends a message to the agent.
  • AI Agent Node (Tools Agent):
    • Prompt: Takes input from the previous node (user's chat message). Can include a system message (e.g., "helpful assistant").
    • Chat Model (Brain): Connected to an OpenAI chat model (GPT-4o).
    • Memory: Configured with "Window Buffer Memory" to maintain conversational context.
    • Tools:
      • Workflow Tool: References the "Grab Profiles" workflow (the tool).
      • Tool Description:

        "Call this tool to get LinkedIn profiles"

        – guides the agent on when to invoke this workflow.
      • Response Field: Set to response to receive the done signal from the tool.
---

Demo and Validation

The agent was successfully tested with various queries:

  • "CEOs in real estate in Chicago"
  • "Founders in technology in San Francisco"

The system accurately identified and populated the correct LinkedIn profiles into the Google Sheet, demonstrating its dynamic functionality beyond hard-coded variables. Execution logs in n8n confirm the flow of data and the parsing steps.

Google Scraping AI Agent: FAQ

Google Scraping AI Agent: FAQ

What is the primary purpose of the Google scraping AI agent described in the sources?

The primary purpose of the Google scraping AI agent is to automate the process of searching Google for specific types of LinkedIn profiles (e.g., CEOs in real estate in Chicago, Founders in tech in San Francisco) and then extracting and saving those LinkedIn profile URLs into a Google Sheet. This allows users to quickly build a database of targeted professional contacts.

What are the essential tools and platforms required to build this AI agent?

To build this AI agent, you will need:

  • n8n: This is the low-code automation platform where the workflows for the agent and the scraping tool are built.
  • OpenAI API: This is used to integrate AI models (like GPT-4o) for parsing user queries and assisting with code generation for data extraction. You'll need an OpenAI account and API key.
  • Google Sheets: This serves as the database where the scraped LinkedIn profile URLs are stored. A pre-setup Google Sheet is required.
How is the AI agent structured, and what are its two main components?

The AI agent is structured into two main workflows:

  • The Tool (Google Scraping Workflow): This workflow is responsible for the actual scraping process. It receives parameters (job title, company, industry, location) from the agent, uses OpenAI to format the search query, performs an HTTP request to Google, parses the raw HTML search results to extract LinkedIn URLs, and then appends these URLs to a Google Sheet.
  • The Agent Workflow: This is the interactive component that users chat with. It utilizes a "tools agent" in n8n, an OpenAI chat model for its "brain," and a "window buffer memory" to maintain conversation context. It calls the Google scraping tool workflow when a user requests specific LinkedIn profiles.
How does the Google scraping tool workflow interact with Google search and extract information?

The Google scraping tool workflow interacts with Google search by performing an HTTP GET request to a Google search URL. It constructs the search query using parameters like site:linkedin.com/in/ to limit results to LinkedIn, and then incorporates the user-defined job title, industry, and location. To avoid being blocked by Google, a User-Agent header is included in the request. After receiving a large chunk of HTML data from Google, a "code node" (often generated with the help of an n8n assistant GPT) parses this HTML to extract only the desired LinkedIn profile URLs.

What are the key steps involved in setting up the "tool" workflow that scrapes Google?

The key steps in setting up the Google scraping "tool" workflow are:

  • "When called by another workflow" trigger: Initiates the workflow when the AI agent calls it, passing search parameters.
  • OpenAI node (Message a Model): Parses the input query into structured parameters (job title, company, industry, location) using a system prompt.
  • HTTP Request node: Sends a GET request to Google, constructing the search URL with LinkedIn-specific parameters and the parsed search criteria. It also includes a User-Agent header.
  • Code node: Processes the returned HTML data from Google, parsing it to extract only the LinkedIn profile URLs.
  • Google Sheets node (Append Row): Takes the extracted LinkedIn URLs and adds them as new rows to a designated Google Sheet.
  • Set node: Marks the workflow as "done" by setting a "response" field, informing the calling AI agent that the task is complete.
How does the AI agent handle user interaction and leverage the scraping tool?

The AI agent handles user interaction via a chat interface. When a user sends a message, the "chat message" trigger in the agent workflow activates. The "AI Agent" node interprets the user's request. If the request requires scraping LinkedIn profiles, the agent's "brain" (OpenAI chat model) identifies that it needs to use the "grab profiles" tool (the Google scraping workflow). The agent passes the relevant parameters from the user's query to the tool. Once the tool completes its operation and signals "done" via the "response" field, the agent formulates a confirmation message back to the user, indicating that the profiles have been obtained.

What is the role of memory and a "chat model" in the AI agent workflow?

In the AI agent workflow:

  • Chat Model (OpenAI chat model, e.g., GPT-4o): This serves as the agent's "brain." It processes user inputs, understands the intent, and decides which tools to use. It's responsible for the agent's reasoning and formulating responses.
  • Memory (Window Buffer Memory): This component gives the agent context for ongoing conversations. It allows the agent to remember previous questions and responses, enabling more fluid and coherent interactions where users can refer back to earlier parts of the conversation without the agent losing track.
Are there any limitations or potential issues with this Google scraping approach, especially regarding the number of results?

Yes, there are limitations. The current approach, which uses a direct HTTP request to Google, seems to be limited to returning around 10 results. Attempts to retrieve more results by adding a "start=X" parameter (like page 2) often lead to issues where Google may present a CAPTCHA or otherwise block the request, making it difficult to scrape beyond the first page. The source suggests that for more comprehensive scraping (i.e., beyond the first 10 results), an actual Google search API like SerAPI might be necessary to avoid these bot detection mechanisms.

Posts Gallery

Posts Gallery

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top