Skip to content Skip to footer

How phishing threats can be neutralised with web scraping

Image: Adobe Stock / Michael Traitov

Identifying fraudulent emails may be easy for tech-minded professionals – but not so obvious for the general public. At the same time, tactics used by cybercriminals to trick users into giving up sensitive financial and personal information are growing in number and sophistication.

In cases where education and awareness are not enough, web scraping identifies fraudulent websites that capture sensitive user data, effectively stopping phishers in their tracks.

Millions of new users of varying ages and technical abilities are coming online each year all over the world. While many can easily recognise an online scam, fake email address, or fraudulent website, not everyone has the same skills or experience.

Some people might look at a malicious spam message and laugh before deleting it. Others take those emails very seriously and follow links to professionally-produced fraudulent websites that prompt them to input critical financial and personal details.

Known widely as ‘phishing,’ these scams are prevalent and growing in number and severity. According to the US Federal Bureau of Investigation (FBI), email wire fraud sent to company employees cost businesses $26 billion since 2016. Besides causing billions in damages, phishing harms the reputation of a company the scammers pretend to be representing, reduces trust between businesses, and compromises system cohesion.

Types of phishing threats

Phishing crime varies by method, however, all types generally attempt to extract credentials from users by impersonating a legitimate site. Some common forms to look out for include:

Deceptive phishing: Deceptive phishing is the most common type of email fraud where cybercriminals impersonate a legitimate organisation in order to obtain personal data or login credentials. Messages sent by phishers typically contain shortened links or legitimate URLs that redirect to a fraudulent site. All contain replicated brand logos and other company-specific marketing content.

Example: An email that appears to come from a payment processing company that states the user has received a refund, with instructions to visit a website and enter banking details.

Spear phishing: Spear phishing is more direct and specialised than deceptive phishing. In this case, the attacker researches the victim’s personal information in advance on a social network, business directory, or industry website. Using those details, they send a customised message that contains a malicious URL that sends them to a website asking for personal data.

Example: A fraudulent message from a credit card company email customised with details obtained from a social network asking to verify login credentials and contact information.

Smishing: Smishing uses SMS messages instead of emails. Some of these messages contain links to fraudulent sites, while others attempt to force-download a malicious app that deploys ransomware to control the device.

Example: An SMS message that appears to come from a banking institution that requests the user to download an app and register for services using banking login details.

Vishing: Vishing is likely the most interactive form of phishing. In this scenario, an imposter uses a Voice over Internet Protocol (VoIP) server to call their victim with requests for personal information. In some cases, attackers first send an email or SMS that contains instructions to call a phone number.

Example: Victim receives a phone call and is told they have a deposit pending from a foreign country, with a request to visit a site to enter banking details.

Phishing threats are growing

Phishing threats are growing, especially in the last two years. According to a report from 2020, more than 75% of respondents admit to opening emails from unknown senders, and over 50% claim these websites look more realistic than ever before.

A recent Google report supports this trend, stating that the number of phishing websites increased by 350% in 2020 to 522,495 by March. In addition to typical requests for banking details, attackers used Covid-19-related content to frighten users into giving up personal information.

How web scraping identifies phishing websites

Businesses take action to fight phishing by scanning all outgoing and incoming emails by employing proxy networks. Web scraping is an integral part of the process that extracts data from linked websites and enables analysts to determine if they are legitimate businesses.

Proxies provide anonymity, allowing cybersecurity professionals to evade detection. However, since phishers are aware they are being targeted, they routinely block IPs suspected of belonging to security companies. To address this issue, residential proxies deployed from varying locations act as an intermediary to provide anonymity and bypass geolocation restrictions. In addition to providing privacy, proxies also distribute requests and prevent server issues.

Advanced web crawlers provide increased accuracy

Thousands of new phishing websites appear each year, targeting banking institutions, financial firms, cloud-based storage sites, and government websites. Unfortunately, existing phishing detection tools often fail to identify current threats because they depend on legacy databases of previously identified fraudulent websites.

Recent research has led to advanced web scrapers that rate the heuristics (content factors) of both genuine and illegitimate websites with increased precision. Data collected using these applications is then analysed using a data mining tool to find patterns, report findings, and detect fraudulent websites with greater accuracy.

The phishing website detection framework is based on content-based heuristics generated from training data sets collected from active and previously detected phishing websites. Advanced web crawlers scrape relevant information from data sets, and a mining tool identifies heuristics typical to fraudulent websites. Following analysis, weights are calculated to produce a ‘phishing factor’ that enables users to determine the probability that a website is illegitimate.

Finally, machine learning can serve as the great next step in phishing detection. Scraping produces a ton of new data every day in addition to the numerous sources already available.

Models can be trained on the labeled data that’s available in public databases. Scraping would then bring the necessary test data to see whether these ML models can perform well in real-world scenarios.

Picture of Aleksandras Šulženko
Aleksandras Šulženko
Product Owner at Oxylabs

You may also like

Stay In The Know

Get the Data Centre Review Newsletter direct to your inbox.