Web Data Scraping

The initial layer compiles an exhaustive list of new and existing crypto projects. This involves parsing data from various sources, both centralized, like CoinMarketCap, and decentralized, such as newly created launchpad pools, etc.

The first filtering stage employs complex regular expressions to extract datasets containing the essential information needed for accurate further analysis. We have developed programs in Node.JS (Typescript) that simulate a web browser environment to accomplish this. These programs navigate various website sources, seeking out new projects in real time. They also connect via RPC with blockchain smart-contract flows, such as Ethereum pre-sale launchpad smart contracts, which sometimes provide new or additional textual information that we extract. Once we’ve gathered the essential information for a new project, it’s stored in our MongoDB database and then moved to the second layer of the web data processor. With the foundational data established — the information the project team wants public — we delve deeper, extracting details like community information.

The next step for the second layer involves going directly to the project’s website to collect all text and image information from every page, along with any documentation, GitBooks, whitepapers, and team details.

We temporarily store all found information on our servers, then apply an algorithm known as Levenshtein distance to ensure the accuracy and uniqueness of the information, avoiding duplication of any data already present.

Additionally, the roadmap is thoroughly analyzed to extract all possible milestones, assessing the ambition of the team’s vision. The token economics are also carefully reviewed, detailing aspects such as the presence of an associated token, its utility, and the circulating supply at the Token Generation Event (TGE), among other factors. We aim to extend our web data analysis to include scrutinizing the GitHub source code from the project’s public repository. By submitting samples to AI models, we intend to prevent the replication of other projects and ascertain whether a certain quality standard has been met.

Last updated