By Misha Zanka

1/26/2025

Search startup jobs with Python and LLMs

Company websites contain a lot of job listings that aren't always available on popular job boards. For example, finding remote startup jobs could be challenging, as these companies may not even be listed on the job boards. To find these jobs you need to:

  • Find promising companies
  • Search for their career pages
  • Analyze available job listings
  • Manually record job details

It requires a lot of time, but we are going to automate it.

Preparation

We'll use the Parsera library to automate job scraping. Parsera provides two usage options:

  • Local: Pages are processed on your machine using an LLM of your choice;
  • API: All processing occurs on Parsera's servers.

In this example we'll go with the Local option, since this is a one-time, small-scale extraction.

To get started, install the required packages:

pip install parsera
playwright install

Since we're running the local setup, an LLM connection is required. We'll use OpenAI's gpt-4o-mini, for simplicity, which only requires setting an environment variable:

import os
from parsera import Parsera
 
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY_HERE>"
 
scraper = Parsera(model=llm)

With everything set up, we're ready to start scraping.

Step 1: Getting a list of the fresh Series A startups

First, we need to find a list of companies of interest and their websites. I've found a list of 100 Series A startups that closed their rounds last month. Growing companies with fresh rounds seems like a good place to look.

Let's grab the country and website of these companies:

url = "https://growthlist.co/series-a-startups/"
elements = {
    "Website": "Website of the company",
    "Country": "Country of the company",
}
all_startups = await scraper.arun(url=url, elements=elements)

Having the country, we can filter the country of our interest. Let's narrow down our search to the United States:

us_websites = [
    item["Website"] for item in all_startups if item["Country"] == "United States"
]

Step 2: Finding Careers pages

Now, we have a list of the websites of new Series A startups from the US. The next step is to find their careers page. We'll do it straightforwardly by extracting careers pages from their main pages:

from urllib.parse import urljoin
 
# Define our target
careers_target = {"url": "Careers page url"}
 
careers_pages = []
for website in us_websites:
    website = "https://" + website
    result = await scraper.arun(url=website, elements=careers_target)
    if len(result) > 0:
        url = result[0]["url"]
        if url.startswith("/") or url.startswith("./"):
            url = urljoin(website, url)
        careers_pages.append(url)

Note, that there is an option to replace this step with calling Search API, replacing LLM calls with search calls.

Step 3: Scraping open jobs

The last step is to load all open jobs from the careers pages of the websites. Let's say we are looking for a software engineering job, then we'll look for the job title, location, link, and whether it's related to software engineering:

jobs_target = {
    "Title": "Title of the job",
    "Location": "Location of the job",
    "Link": "Link to the job post",
    "SE": "True if this is a software engineering job, otherwise False",
}
 
jobs = []
for page in careers_pages:
    result = await scraper.arun(url=page, elements=jobs_target)
    if len(result) > 0:
        for row in result:
            row["url"] = page
            row["Link"] = urljoin(row["url"], row["Link"])
    jobs.extend(result)

All jobs are extracted and we can filter out all that is not software engineering and save them to a .csv file:

import csv
 
engineering_jobs = [job for job in jobs if job["SE"] == "True"]
 
with open("jobs.csv", "w") as f:
    write = csv.writer(f)
    write.writerow(engineering_jobs[0].keys())
    for job in engineering_jobs:
        write.writerow(job.values())
 

At the end, we have a table with a list of jobs that looks like this:

TitleLocationLinkSEurl
AI Tech Lead ManagerBengaluruhttps://job-boards.greenhouse.io/enterpret/jobs/6286095003Truehttps://boards.greenhouse.io/enterpret/
Backend DeveloperTel Avivhttps://www.upwind.io/careers/co/tel-aviv/BA.04A/backend-developer/all#jobsTruehttps://www.upwind.io/careers
...............

Conclusion

As a next step, we could repeat the same process to extract more info from the full job listing. Like getting the tech stack or filtering for a remote startup job. This will save time on manually reviewing all pages. You can try it yourself by iterating over Link fields and extracting elements of your interest.

Hope you found this article helpful and if you have any questions let me know.