How to use PyAirbyte MCP Server to create data pipelines?

AI tools are getting really good at helping developers, but they don't always work smoothly with the tools we actually use every day. That's where MCP (Model Code Protocol) comes in. It's a simple way to connect AI models to the real tools, scripts, and services that developers need - making it possible to automate actual work right inside your development setup.

Why we built this

Let's look at Cursor as an example. It's a super popular AI-powered code editor. If you're a developer building data pipelines, you probably already use PyAirbyte or similar libraries to build these pipelines in code. Usually, you'd have to look up documentation to figure out what settings each source and destination needs, read through framework docs for handling streams and errors, and then finally start writing code.

With AI and the new PyAirbyte MCP server, all that research goes away. If you're building pipelines in Python, you just need a simple request and you're done.

About the PyAirbyte MCP server

The PyAirbyte MCP server is developed by experienced AI developers who are constantly experimenting with new tools and frameworks. Since AI is moving so fast, this project aims to share learnings and outcomes with the developer community. While this is not an officially supported project, the community is encouraged to share feedback and collaborate with other developers building with AI.

Get started with the PyAirbyte MCP Server

Setting up the PyAirbyte MCP server is straightforward and takes just a few minutes. Follow these simple steps to connect it with Cursor and start generating data pipelines instantly.

Step 1: Add the server to Cursor

The PyAirbyte MCP server runs remotely on Heroku. To add it to Cursor, go to Settings > Cursor Settings > Tools & Integrations, and add a new MCP Server.

Note: If you're not on the latest Cursor version, look for Settings > Cursor Settings > MCP Tools

Step 2: Configure the server

Copy and paste this mcp.json config, but add your own OpenAI API key:

{
  "mcpServers": {
    "pyairbyte-mcp": {
      "url": "https://pyairbyte-mcp-7b7b8566f2ce.herokuapp.com/mcp",
      "env": {
        "OPENAI_API_KEY": "your-openai-api-key"
      }
    }
  }
}

Step 3: Verify the connection

Save the config, and you should see a new server called pyairbyte-mcp with one tool enabled: generate_pyairbyte_pipeline. You'll see a green dot and "1 tools enabled" if everything's working.

Once the server is connected properly, it knows to listen for chat prompts like "generate pipeline" or "create pipeline".

Step 4: Create your first pipeline

Let's try it out. Start a new chat and type:

"create a data pipeline from source-postgres to destination-snowflake"

Notice the "source-" and "destination-" prefixes. These tell the system exactly what kind of connector you want from the Airbyte registry. Using these prefixes makes sure the AI doesn't get confused and give you a destination config when you wanted a source, or vice versa.

Step 5: Give permission and generate

Cursor will understand your request and run the MCP server. If it's your first time, you'll be asked for permission to execute the prompt. Just give it permission, and the MCP server will generate the code and setup instructions for you.

Note about .env files: By default, Cursor can't create .env files unless you've specifically allowed this in Settings by modifying the global gitignore. Even if you don't allow direct .env writing, the MCP will give you all the instructions to create your own and use the provided configuration.

Try advanced examples

And that's it! With one simple prompt, you just created a complete data pipeline. Why not try another prompt that combines a pipeline request with something common like visualizing data with popular frameworks like Hex, Evidence, or Streamlit?

Example: Pipeline with visualization

"create a pipeline from source-faker to dataframe and visualize the data in a streamlit bar chart"

The MCP server will create a pyairbyte_pipeline.py file that creates a dataframe and makes it available for use in your codebase.

And since you're using the PyAirbyte MCP server within Cursor's chat, the AI will also create the relevant Streamlit code for you, with full context on how to access the dataframe.

Under the hood

As you can see from the quick setup guide above, the PyAirbyte MCP server is really easy to use and quick to set up. But there are some interesting things happening behind the scenes that are worth exploring.

PyAirbyte MCP

The actual server implementation is a python app, deployed on Heroku. The app uses the FastMCP framework and @mcp.tool() annotation to expose tools for clients to call. The server currently only provides one tool - generate_pyairbyte_pipeline - but there are plans for many more. As new tools are added to the server, all you need to do is stop and restart your config in Cursor to pick up the new tools.

Looking at the code for the generate_pyairbyte_pipeline tool, you'll notice it returns a dictionary object. If you're building local MCP servers, you can return whatever you want. But for remote servers, this must be a Dictionary object.

@mcp.tool()
async def generate_pyairbyte_pipeline(
    source_name: str,
    destination_name: str,
    ctx: Context # MCP Context object
    ) -> Dict[str, Any]:

Inside the tool function, the server uses OpenAI's chat completion API along with vector embeddings of relevant documentation that's been loaded into OpenAI's file storage. This approach lets the server figure out which libraries and coding patterns it needs to implement using the PyAirbyte framework, including what configuration settings each Airbyte connector requires. The output from the OpenAI request is then used to fill in code templates and generate instructions for the developer. The server currently uses gpt-4o as the LLM, but will likely switch to openrouter soon to allow for using the best model for generating code.

response = await openai_client.chat.completions.create(
    model="gpt-4o", 
    messages=[
        {"role": "user", "content": query}
    ],

OpenAI file storage

A few months ago, OpenAI launched file storage with full embedding support as part of several tool updates to the Response API. The PyAirbyte MCP server uses this vector store to give OpenAI context on how to write pipelines, best practices using the PyAirbyte library, and what configuration parameters are needed for each of the 600+ connectors in the Airbyte Connector Registry. With this context, the MCP server knows everything it needs to generate pipelines without relying on general knowledge the LLM picked up from the internet. Based on experience, providing data context is critical to get accurate results from any AI agent or MCP service.

Remote MCP

Each client supports the MCP schema a bit differently. Cline, for example, throws a schema error when you add the mcp.json. This is because of an unsupported env attribute. The PyAirbyte MCP server requires users to pass in their OpenAI API key to use the service. Because of this dependency on env support via the mcp.json, the remote PyAirbyte Server is currently designed to only work with Cursor. Also, Claude Desktop doesn't currently support remote servers at all, although Claude Code does, but this hasn't been fully verified.

Summary

The PyAirbyte MCP server is available now for you to add to Cursor. With a short prompt, you can now let AI generate a complete data pipeline using any of the hundreds of connectors offered by Airbyte. More tools are being developed for developers to use popular AI tools and frameworks with Airbyte to create an amazing developer experience for building any app that needs data and context for their AI-powered workflows.