Web Scraper
2023-05-24
A Python program I created that scrapes emails from websites.
I created this web scraper on behalf of a client. They had a list of UK schools and they wanted to find as many PTA-associated emails as they could from the school websites. They were doing it by hand when they approached me, asking if I could help them out.
The scraper takes in a school URL and attempts to look for the sitemap of the website. This will give it a list of all the pages in the school’s website. Then it goes page by page and uses the Python module BeautifulSoup to filter for <a> tags in the HTML. It then uses a reegular expression (RegEx) to check if the link is an email or a website. If it is an email, it will add it to a list of emails to be checked. Once we get as many emails as we can from the website, we then compare the emails to another RegEx. This was actually a set of expressions to look for PTA emails in descending order of likelihood that it would be a PTA email. If an email was found, it was saved with a likelihood score, and if one wasn’t, that was recorded.
There were several techniques used speed up the search. Firstly, when looking at the sitemap, certain links, such as images, diary, and news pages, were filtered out, as there can often be hundreds of these and they slow down the search. I also used BeautifulSoup to parse through the HTML, as this was quicker than using a RegEx. I also narrowed down the search range by only looking at schools that had student’s in the clients target age group.
This program is by no means perfect. If an email is not stored as a link, it will be missed. If the email is only shown in an image, such as an embedded poster, it will be missed. Due to the lack of rules governing web design, and the lack of pattern in PTA emails, there would always be edge cases the program missed. To that end, I delivered the client a list of emails that I had found, and a list of school sites where the program couldn’t be sure about an email, so they could go back and double check.