Our Global Sensor Network vs Web Scraping

If you’ve purchased technology data before, you may be familiar with “web scraping” -- a method commonly used to determine the SaaS applications on a particular website. 

In this case, many products have a unique tag or snippet that must be added to a site's code to run the software. 

By looking for these code snippets across millions of domains, companies like BuiltWith and SimilarTech can determine the products installed on a given website.

There are many challenges to this methodology, let’s cover four big ones below:  

Scraping is binary. 

From our previous example, web crawling may tell you that intricately.com uses Heroku for hosting, but it can’t tell you how much we’re spending or we use the product. 

Larger businesses use multiple hosting providers to power their websites, apps, APIs and more, so understanding this relationship is extremely important.

Scraping leads to false positives.

Adding code to a website is easy, and there is minimal performance cost to leaving it there. 

Many sites have code installed that is no longer active or code that has been added but never utilized, which means you have no insight into which SaaS applications a company is using.

Scraping is domain-based.

Take a company like Nike, which operates hundreds of domains around the world. Web scrapers will treat each of those domains as a distinct entity, inflating the count of deployments and giving a false sense of usage and breadth.

Scraping misses many products.

Web scraping is limited to products installed on websites. Many providers like Google Cloud, Amazon Web Services, and Neustar do not require code snippets because they operate behind a web server. 

To identify these products and how much a business is spending, you need to approach the problem from a whole new angle.

Did this answer your question?