Skip to main content

How to Scrape Google Flights

· 7 min read
Satyam Tripathi

How to Scrape Google Flights

Google Flights collects information from different airlines and travel companies to show you all the flights available, their prices, and schedules. This helps travellers to compare airline prices, check flight durations, even track environmental impact, and at last find the best deals.

In this tutorial, I’ll show you how you can easily scrape all the data you need from Google Flights using Python and Playwright.

What Data to Extract from Google Flights?

Google Flights has a lot of valuable information. To get started, apply filters for your departure and destination. For example, I’m looking for a one-way flight from Boston to Honolulu. Once you set your filters, you’ll see a range of flight options.

google flights first page

We’re going to extract several key details, including:

  • Flight name
  • Departure time
  • Arrival time
  • Flight duration
  • Prices
  • Number of stops
  • CO2 emissions (to help you choose more eco-friendly travel options)
  • CO2 emissions change.

Let’s dive in and start scraping the data!

Prerequisites

We’ll be using Python with the asynchronous version of Playwright for this tutorial. While Playwright offers both sync and async options, the async version is faster and more efficient for our scraping tasks.

If you’re new to Playwright, I recommend checking out my Playwright 4-part series. It covers everything from the basics to more advanced techniques.

Here’s what you need to get started:

  1. Python Installation: Make sure you have the latest version of Python installed.
  2. Code Editor: Choose a code editor that you’re comfortable with. Popular choices include: PyCharm, Visual Studio Code, and Jupyter Notebook.
  3. Basic CSS or XPath Knowledge: Some familiarity with CSS selectors or XPath will help, but don’t worry if you're unfamiliar with these concepts, this tutorial is designed to be straightforward and easy to follow.

Installing Libraries

To get started, we need to install Playwright and the required browser binaries. Simply run the following commands in your terminal:

pip install playwright
playwright install

Once you’ve completed these steps, you’re all set to start scraping flight data from Google Flights. Let’s get coding!

Scrape Airline Names from Google Flights

To scrape the airline name, you can target the div with the class .sSHqwe. This class contains the airline’s name, and you can access it using the query_selector method.

airline name

Here’s the code snippet:

airline = await flight.query_selector('.sSHqwe')

The query_selector method finds the first element that matches the provided selector. If no element is found, it returns null.

Scrape Flight Duration

Next, to scrape the flight duration (e.g., "13 hr 27 min"), use the div tag with the class .gvkrdb.

flight duration

Here’s the code to extract the duration:

duration = await flight.query_selector('div.gvkrdb')

Scrape Flight Prices

To extract the price of the flight, target the div tag with the class.FpEdX and then grab the span tag containing the price.

flight prices

Here’s the snippet:

price = await flight.query_selector('div.FpEdX span')

Scrape Departure and Arrival Times

To scrape both the departure and arrival times, you can use selectors targeting span elements with aria-label attributes. For example, to extract the departure time, use [aria-label*="Departure time"] and for the arrival time, [aria-label*="Arrival time"].

departure arrival time

Here’s how you can do it:

departure_time = await flight.query_selector('span[aria-label*="Departure time"]')
arrival_time = await flight.query_selector('span[aria-label*="Arrival time"]')

Scrape CO2 Emissions and Emission Changes

To extract data on CO2 emissions, use the div with class .O7CXue for CO2 emissions and .N6PNV for emission changes.

co2 emissions

Here’s the code for CO2 emissions:

co2_emissions = await flight.query_selector('div.O7CXue')

And for the emission variation:

co2 emissions variations

Here’s how you can do it:

emissions_variation = await flight.query_selector('div.N6PNV')

Scrape Flight Stops

Finally, to extract information about flight stops, use the div with class .hF6lYb and the span inside it with class .rGRiKd. This will give you details about the number of stops and where the flight stops.

flight stops

Here’s the code:

stop_info = await flight.query_selector('div.hF6lYb span.rGRiKd')

Complete Code

Here’s the complete code combining all the snippets

import asyncio
import json
from playwright.async_api import async_playwright

async def scrape_flight_data():
flight_data = [] # List to store flight details

# Start Playwright and open a browser
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True) # Launch browser
page = await browser.new_page() # Open a new browser page

# Go to the flight search results page
await page.goto("https://www.google.com/travel/flights/search?tfs=CBwQAhoqEgoyMDI0LTA9LTI1agwIAxIIL20vMDFjeF9yDggDEgovbS8wMmhyaDBfQAFIAXABggELCP___________wGYAQI&tfu=EgYIABAAGAA&hl=en&gl=IN")

# Wait for relevant flight data to load
await page.wait_for_selector(".pIav2d")

# Extract flight data
flights = await page.query_selector_all(".pIav2d")
for flight in flights:
# Extract departure time
departure_time = await flight.query_selector('span[aria-label*="Departure time"]')
departure_time_text = await departure_time.inner_text() if departure_time else "N/A"

# Extract arrival time
arrival_time = await flight.query_selector('span[aria-label*="Arrival time"]')
arrival_time_text = await arrival_time.inner_text() if arrival_time else "N/A"

# Extract airline
airline = await flight.query_selector(".sSHqwe")
airline_text = await airline.inner_text() if airline else "N/A"

# Extract flight duration
duration = await flight.query_selector("div.gvkrdb")
duration_text = await duration.inner_text() if duration else "N/A"

# Extract stop information
stop_info = await flight.query_selector("div.hF6lYb span.rGRiKd")
stop_info_text = await stop_info.inner_text() if stop_info else "N/A"

# Extract price
price = await flight.query_selector("div.FpEdX span")
price_text = await price.inner_text() if price else "N/A"

# Extract CO2 emissions
co2_emissions = await flight.query_selector("div.O7CXue")
co2_emissions_text = await co2_emissions.inner_text() if co2_emissions else "N/A"

# Extract CO2 emissions variation
emissions_variation = await flight.query_selector("div.N6PNV")
emissions_variation_text = await emissions_variation.inner_text() if emissions_variation else "N/A"

# Add the flight details to the flight_data list
flight_data.append({
"departure_time": departure_time_text,
"arrival_time": arrival_time_text,
"airline": airline_text,
"duration": duration_text,
"stops": stop_info_text,
"price": price_text,
"co2_emissions": co2_emissions_text,
"emissions_variation": emissions_variation_text,
})

# Save the extracted data to a JSON file
with open("flight_data.json", "w", encoding="utf-8") as json_file:
json.dump(flight_data, json_file, ensure_ascii=False, indent=4)

# Close the browser
await browser.close()

# Run the async function
asyncio.run(scrape_flight_data())

The final result is:

[
{
"departure_time": "8:44 AM",
"arrival_time": "4:11 PM",
"airline": "United",
"duration": "13 hr 27 min",
"stops": "1 stop in LAX",
"price": "₹20,762",
"co2_emissions": "541 kg CO2e",
"emissions_variation": "-17% emissions"
},
{
"departure_time": "10:00 AM",
"arrival_time": "9:45 PM",
"airline": "Alaska",
"duration": "17 hr 45 min",
"stops": "1 stop in SEA",
"price": "₹20,762",
"co2_emissions": "525 kg CO2e",
"emissions_variation": "-20% emissions"
},
{
"departure_time": "6:10 AM",
"arrival_time": "2:00 PM",
"airline": "Delta",
"duration": "13 hr 50 min",
"stops": "1 stop in SLC",
"price": "₹21,597",
"co2_emissions": "520 kg CO2e",
"emissions_variation": "-20% emissions"
},
...
...
...
]

Using a Web Scraping API

To handle dynamic content on Google Flights and overcome challenges such as IP blocking, CAPTCHA challenges, and anti-scraping measures implemented by Google Flights, traditional bypass methods like using proxies or rotating IP addresses can improve success rates. However, these methods are not foolproof, as Google Flights may continue to implement new countermeasures.

To reliably scrape any website, regardless of its anti-bot complexity, using a web scraping API like ScrapingAnt is highly effective. It automatically handles Chrome page rendering, low latency rotating proxies, and CAPTCHA avoidance, so you can focus on your scraping logic without worrying about getting blocked.

To start using the ScrapingAnt API, you only need two things: the URL you’d like to scrape and the API key, which can be obtained from your ScrapingAnt dashboard after signing up for a free test account.

scrapingant dashboard

To integrate the ScrapingAnt API into your Python project, install the Python client scrapingant-client :

pip install scrapingant-client

You can also explore more on the GitHub project page.

The ScrapingAnt API client is straightforward to use, supporting various input and output formats as described on the Request and Response Format page. Below is a simple example demonstrating its usage:

from scrapingant_client import ScrapingAntClient

client = ScrapingAntClient(token="YOUR_SCRAPINGANT_API_KEY")

response = client.general_request(
"https://www.amazon.com/Dowinx-Headrest-Ergonomic-Computer-Footrest/dp/B0CVWXK632/"
)
print(response.content)

Here's our result:

scrapingant result

This shows how ScrapingAnt simplifies the web scraping process by handling the complexities for you.

Conclusion

In this tutorial, you learned how to scrape flight data from Google Flights using Python and Playwright. We covered how to extract key details like flight names, departure and arrival times, flight durations, prices, number of stops, CO2 emissions, and emission changes.

By following the steps outlined in this tutorial, you can easily scrape flight data from Google Flights and use it to find the best deals for your next trip. You can also use a web scraping API like ScrapingAnt to handle dynamic content and anti-scraping measures, ensuring reliable and efficient data extraction.

Forget about getting blocked while scraping the Web

Try out ScrapingAnt Web Scraping API with thousands of proxy servers and an entire headless Chrome cluster