Discovering Scammer Networks with Machine Learning

re:doubt. TVM Data for Everyone
5 min readJun 14, 2023

--

Community Spotlight: This article was guest authored by Senior Data Scientist, Carlos Salort.

The technique of machine learning, and in particular Deep Learning, is steadily gaining relevance as a tool for threat detection.

In this blog post, we will present the latest bot, the main idea for this bot is to use the transfers made between addresses as information to find addresses belonging to scammers, but still not flagged as such. The labelled addresses labelled are used as the base for the propagation, and new addresses can be flagged before they are involved in malicious activity. We will explain this new approach in a general way, without going too deep into the technical details.

The scammer label propagation bot runs based on two techniques: Graph neural networks, a type of deep learning model that use the additional information that can be found in graphs to improve its predictions capabilities; and semi-supervised learning, a type of machine learning used when each training observation does not necessarily have a label.

Graph Neural Networks

In order to use graph neural networks, we need to generate a graph. This bot will generate a graph around a known scammer. That is, once a scammer is labelled with a high level of confidence, the bot will try to find all the other potential addresses which belong to the scammers.

Once the scammer has been labelled, the first step is to obtain all the addresses that had some transaction with the central scammer. For the first version of this bot, a transaction only covers direct ETH transactions and ERC20 tokens transactions. For each of the addresses found, we collect a summary of their complete transaction history. This includes the number of transactions in/out, average value, max value and total value for both eth and ERC20 tokens. Each of the addresses will be a node in our graph, and the summary is what we will use as node features.

After compiling the complete list of nodes, we need to obtain the edges between them. We will use a directional graph, that is, the origin and destination of the edge are relevant. This is important to understand who is generating a transaction and who is receiving it. We collect all the ETH and ERC20 transactions between any two nodes in our graph. All of these will be connected to the central scammer, but there will be extra transactions between the other nodes. Similarly to the nodes, we also collect some information about each of the edges. Instead of getting a summary of the whole address, we get a summary of all the transactions corresponding to the particular edge. For example, if there is an edge going from node A to node D, we will collect the number of transactions, total, average and max value sent from A to D. We will call this data edge features.

We will use the node features and the edge features as the input for our model. For this application, we are using a custom Graph Neural Network. The model consists of two layers of TransformerConvolution and two dense layers. The model is implemented in Python, using PyTorch and PyTorch geometric for the graph layers. The architecture of the model looks as follows:

Now we have prepared the input data and the model to make the predictions. But we still need a critical component: The labels of the addresses. Without them, the model can’t learn to differentiate between victims and attackers.

Semi-supervised learning

Once we have the list of nodes and edges, we will use the labels to train the model. These labels are generated by other bots in the network. After querying the labels, the graph will look like this:

We can see that we have labels in five of the nodes. In a real graph, the percentage of nodes with labels would be much lower (in relation to the total number of nodes). This presents a problem: Our model can’t be trained using supervised learning, a learning method based on training the model in a complete set of data and letting it predict previously unseen data.

To bypass this problem, we will use semi-supervised learning. In this case, we will train the model using the small subset of labels that we have. With the information gained from that data, then the model will make predictions in the remaining nodes of the graph, effectively transferring the information learned about the relations between known nodes to the rest.

After the training process, the model is able to generate predictions. The final graph looks like the following image, where the intermittent-lined nodes are the predictions. This model would mark two new attackers, node E (due to how similar it is to node J which was a labelled attacker) and node H (similar to D).

Conclusion

In this post, we saw how more advanced deep learning algorithms can be used to help defend against threats, even before they happened, just by analyzing the relations between addresses

Join our community on Telegram or Twitter, to stay up to date with the latest news, announcements, and exciting developments. Connect with like-minded individuals, engage in discussions, and be part of a growing community shaping the safe future of TON.

Stay tuned and be among the first to know when we go live!

--

--

re:doubt. TVM Data for Everyone

re:doubt is a powerful tool for TVM blockchain research, complete with all the tools you need to discover, explore, visualize vast amounts of blockchain data