Thomas de Marchin & Milana Filatenkova: Data Science on Blockchain – NFT Analysis

Thomas & Milana are both data scientists who have been analysing blockchain data for interesting insights about NFTs. We discuss their process and what they recommend for others conducting this type of research for the very first time.

You can read more about their research below.

NFT Analysis Articles:

Data Science on Blockchain with R. Part I: Reading the blockchain

Data Science on Blockchain with R. Part II: Tracking the NFTs

Helium Article:

Data Science on Blockchain with R, Part III: Helium-based IoT is taking over the world

You can follow the guys here for upcoming articles and research

Linkedin – https://www.linkedin.com/in/tdemarchin/ https://www.linkedin.com/in/mfilatenkova/

Twitter – https://twitter.com/tdemarchin

Medium – https://tdemarchin.medium.com/

The following is a rough transcript which has not been revised by Ocean Missions. Please check with us before using any quotations from this transcript. Thank you.

[00:00:00] Scott: Today. We’re going to be speaking with Thomas and Milana. They’re going to be talking to us about some of the on chain data analysis work that they did and just step us through, some of the. The kind of pros and cons, the approaches that they, they went for and maybe try and help you with your own experience looking at, and working with on chain data for the first time.

So thank you for joining us on the podcast today, guys.

[00:00:29] Milana: Thanks for having us.

[00:00:30] Thomas: Thanks Scott. It’s a pleasure to be here.

[00:00:33] Scott: Hey, well, I was wondering if we could start maybe with you Thomas and just tell us a bit about yourself and, how you got interested in working with blockchain data.

[00:00:44] Thomas: Yeah.

Yes. Yes. So while we are both data scientists, then we work as consultants for the pharmaceutical industry. I personally discovered the world of blockchain of about five, six years ago. My mother-in-law invested into a strange things called Bitcoin. Look at that and I found it really cool. Then I invested a bit, then I think I bought my first Bitcoin.

They were about $500 and it was 20, something like that. And then I got quickly obsessed by the technology itself and the blockchain technology. I mean I heard a lot about the, trying to understand how it works. And to be honest, I want, it was not immediately clear to me that blockchain of further buss field of discovery for that percent is us.

I initially just so with, as a ledger for financial transactions, And I think it’s only two, three years ago that I realized that blockchain is in fact much more than that. And that there is a massive amount of interesting data just waiting to be discovered. So I think that’s it for me

[00:01:51] Scott: Awesome and Milana. Can you tell us a bit about yourself and how you got interested in working with blockchain data?

[00:01:58] Milana: I’m also data scientists and we work together. We Thomas it’s a company called PharmaLex PharmaLex. It is a consulting company that provides analytical services to the pharmaceutical industry. For example, we help clients design studies and also analyze their data.

So to help them make the most informative decisions from the strategic uh, view point. And a few years ago I found out about blockchain and I got absolutely obsessed with it. I think it is a revolutionary technology that opens the door into a fairer new world because it liberates us. It has a potential to liberate us from this parasitic centralized intermediaries.

And it is only recently though. Thanks to Thomas that I had an opportunity to get a hands on experience analyzing the actual on chain data.

[00:02:49] Scott: Very cool. And this really is how I connected with Thomas. Originally was, was this work that you’re talking about?

So I came across an article that was published on the towards data science publication on Medium.. That was basically a summary of, of the work that , the two of you were working on. So I was just wondering if you could help those listening, just get a quick overview of of what that project was and what, what that was looking at.

[00:03:16] Milana: In fact, we have two articles in the first one we set to simply reads our own chain data and to data, the data from blockchain. First of all, you would need to set up an ETL server. ETL stands for extract transform and load.. Uh, ETL server. It initially it takes quite a few steps before you can actually get started with on chain analysis itself .

It can get quite expensive too, since you would need very performing hard drives to store the blockchain database. But fortunately for us we have discovered that there exists services out there that can provide us with a shortcut. And those are Etherscan for example, or more recently, the one that appeared more recently is called the Graph.

And so what they offer is an API functionality to ease the process of retrieving the data in the right format that can be easily processed by statistical software such as R. So in the first article, we describe how to use API. To download NFT transactions data from the NFT market.

In our case, we use Opensea as an NFT market. Uh, We also show some basic exploratory analysis that can be performed on their own chain. In our case we made a few simple graphs contained in the summary of the transaction prices. So here it is for the first article and uh, Thomas going to talk a little bit about the second one.

[00:04:50] Thomas: Thanks Milana,. Yeah, so, so in the second article we focused more specifically on tracking NFTs. So in like ethereum, bitcoin or regular cryptocurrency,

or money related token, I mean NFTs are unique and you can actually track each of them individually. And that’s what we did. So we decided to track the Weird Whales collection.. The Weird Whale a bit like the crypto punk and the farmers keep the plank, but with little tiny whales, very cute. So there are a total of about a bit more than 3 thousand whales and each of them has a different combination of attributes, different.

Colour or some other art, some smoke cigarettes. So we, we chose that collection because the story behind it is nice. It was actually created by a 12 years old boy who started programming when he was five. That little boy made the bus with his NFT as well as half a million dollar in two months.

Yes. Yes. It was quite profitible.. But so coming back to the article, the first part is about retreiving the data. Again, you, you will use the etherscan, so using the API. But, but I have to say it was way more technical than the, for the first article, because it involved doing a bit of reverse engineering.

To understand the Opensea smart, contract. So you can read the, the smart contact with judge stone, the shin, but you need to understand it to, to, to be able to, to know which that then you need, you need to expect them all to, to hit them. So then once you have the data, oh, and do you visualize.

And that’s very challenging because you have so many NFTs, more than 3000, just for this collection. And so many exchange between so many wallets, you have a lot of data. So you use network analysis tool to make fancy animated network plots. And I think that visual visualization of blockchain that is very challenging because of this huge amount of data.

So, if you try to plot all the transaction data, you would just get the nosy, you know, unreadable plots. So that’s why when you analyze blockchain data and you want, want to work on subset of the data on to, or to find effective ways to surmise it,

[00:07:11] Scott: So you, you mentioned that you did the ETL and then sent it to the database. So when you started using the, the API, so say you were using either the ether scan API or the graph now which has obviously become a bit more popular maybe since I are you still having to write a bespoke sort of ETL script?

Or are you, is that functionality more built into some of those services?

[00:07:41] Thomas: Well, not. So we, we didn’t set up the ETL and because the ETL, it’s a, it’s a complete silver, right? You need to set up a note which will listen to the blockchain and then can you get all the data and store it in a huge database?

So that, that, that can be quite complicated. But so with the services, it seems like etherscan the half and they are money of the. It’s easy. So you just, the way he, the API would just, for example, say, I want to, to listen to the transaction for, from this address between that date and that date and did the, the, the silver will send you back the data in a nice formatted way.

[00:08:18] Scott: So you know, there’s, there’ll be people listening and they’re starting their own projects like this.

I was just wondering if you could rewind back the clock and two, when you’re first thinking about this project and some of the things that you were experiencing at the time or sort of what led you to, to begin working on it.

[00:08:35] Thomas: I would say that if you an experience that a scientist and I have a basic understanding of all blockchain technology works, you already have the right background to get a project like this. If you have no prior experience in data science or whatever, well, you would need first to learn basic your stat language.

We use, ah, you can also use Python. There are others, but R and Python and the main programming language for data science. So that it’s not that difficult, but there’s is a learning curve. If you have no prior knowledge of programming. No, no. Regarding what led me work on this project, we described such a, the one on following the NFT transaction.

One day I just simply got curious about how on earth one gets to hit the blockchain data and it just kicked on from there.

[00:09:29] Scott: Cool. And so when you started to work on the project, What were the first steps you started to take? And, and with the city, any mistakes along the way that, that stood out to you?

[00:09:40] Milana: In my opinion, in the beginning, it is very important to start with setting a specific question that you want to answer and on the way towards answering it.. Avoid getting trapped in endless tuning of tiny details because this can get tiring and frustrating. And in the end will discourage you from moving forward with a question.

And for instance, I remember me spending hours trying to add a little directional arrows in a transaction graph to finally realize that the whole plot was of no use. And that is why my recommendation would be to our listeners to always have a final goal and the approach towards achieving it maybe even written somewhere because that would help you to avoid getting stuck in unimportant details and focus on the goal that you are trying to achieve or with with. your project.

[00:10:39] Scott: It’s interesting. It’s not too dissimilar to normal sort of software development where if you don’t have a clear goal or intention in mind that you can easily get led down many different roads.

[00:10:52] Thomas: Yeah. You can waste days and a useless thing, but yeah, that’s not specific to blockchain. It’s yeah, you’re right.

It’s common for all data science project, I

[00:11:04] Milana: think. Yeah, but this is a completely was a completely new field for us. So much to be discovered. So you can easily divert in going into every direction. But it is important to, to have this focus there to, to, to know what you’re trying to achieve in the end, in, in, in our case, it was try to, to see what are the patterns in the NFT transaction data.

And we will get to that later.

[00:11:31] Scott: And so did you guys write down your question? Well, how did you, how did you stay focused on the, yeah, I mean, we really

[00:11:41] Thomas: do write the question, but it would be a good thing to do well the next time.

[00:11:46] Milana: Yeah. But we are not, we, I, that we are not software developers. We are data scientists, but yeah, the software development, eh it’s still quite quite new to us.

So we still have a lot to learn there.

[00:11:59] Thomas: Well, not for me because I did a lot of software.

[00:12:02] Milana: Okay. Can you, Tom? I used to be software development.

[00:12:06] Thomas: Well, I ended up writing the software by myself, but I was leading the team and I still continue to, to develop interactive web application on a regular basis.

[00:12:20] Milana: Okay. So to me, that’s just

[00:12:24] Scott: cool. Cool. So maybe I was wondering if you could just you sorta touched on some of these key pieces, right? So you know, looking at, at getting the API and structuring it and making sure you have a, have a, have a question in mind that you’re you’re targeting. But is there sort of some key, you know, stages that you would, you would break it down into that we haven’t already covered?

[00:12:47] Thomas: Well if I had to, to, to summarize the main step, I would say, yeah, three, and that’s not different from all the data science project, it’s not specific to blockchain, but I would say that the first part is getting the data and cleaning them. So getting the data. Can take quite some time. The first time you, you work with blockchain to get used to these API or to set up your ETL.

So once you did it once you, you, you will be maybe easier for, for the next. So at once you’re out at that time, of course, the second part is to perform the analysis under the bars, the parties, to write a report or an article to describe what you did just as what you would do with another data science project No, we got in the tool we use.

I think we mentioned it briefly, but for that, that sense that the scientist, there are two main language R and Python. So you will use only R because we are used to it. That’s the language we use on a daily basis for, for. And, and for the one while not familiar with it it’s a functional programming language extensively used by statistician and data scientist, as I mentioned.

And when it comes to data analysis of endless possibilities. So you can do everything with data management, plotting modeling web publication ebook thing it’s open source and easy to learn. And then. Big huge community which developed thousands of thousands of packages to enhance the functionality.

So whatever analysis you would want to perform, it’s very likely that someone already developed a package for. So, so it wasn’t, I mentioned we did everything with our the only the data using the API, doing the data management, the analysis, of course, and even the hypoxemia we write to have to kill using a Macedon to, to produce NY HTML uh report,, so finally once we don’t, we just push everything on Github.

So, so it saves somewhere in the cloud and it can be used by the community

[00:14:58] Scott: I was wondering if you could, maybe, so you obviously got the data and then started to do some analysis on it. Can you maybe share some of the discoveries that you made once you, once you started to do some of the analysis work?

[00:15:12] Milana: So for example, when we looked at the weird whales NFTs transactions data we have discovered that there are some interesting patterns in the evolution of NFT prices all the time. And those can be linked to social media attention to weird whales. Straight after NFTs are minted.

Minton is event of creating a new NFT on the blockchain. There is a period of intense price growth. Then a quieter period follows when the prices drop. Followed again by price. And the letter is clearly associated with an intense activity on the social media, around weird whale collection. So this is what is very interesting.

There is some cycles in the in the data. So you see ups and downs and yes, those can be linked back to what is happening behind the scenes in the, in the social. But of course this pattern is very specific to the weird well collection and other collections may show a different behavior.

We don’t know because we haven’t looked at other collections, but regarding weird Wales one. Yes. So this is the part that we observed that the ups and downs are linked to the social media attention to the. And also, I was surprised by the number of transactions that we retrieved from the blockchain.

The only we looked at one single NFT collection. Which was a recent, by the way. Eh, so as Thomas mentioned, it is very challenging to visualize so much data and there is field of discovery. There is there are lots of opportunities to be made in terms of developing new visualization tools that are tailored to onchain data visual, visually.

[00:17:00] Scott: And what has been the response since you publish the article? Have you been surprised by the response you’ve gotten?

[00:17:08] Thomas: Well, I think we can say it’s been a success.

We see many question and positive feedback. They also, they do everyday new follow works which is I think a good sign. And yeah, just to give a comparrison, isn’t have you see, we had written a few article on pharmaceutical process. That’s the domain in which we were doing the day. That’s a work but I have to say that the number of few for the, to his son, blockchain related articles is all the magnitude above the one about pharmaceutical processes.

So that’s interested.

[00:17:45] Scott: So yeah, people were more just said they need FTS, then pharmaceutical processes who would have thought. Yeah.

[00:17:51] Thomas: Yeah, yeah. It’s good that there’s this test in doing data science on blockchain.

[00:17:57] Milana: Yeah. Yeah. We see, we see clearly that there is a lot of attention to our articles. Even though, again, we are not a specialist, we are kind of amateurs at this stage, but already, already, now our work attracts so much.

So much attention, some credible.

[00:18:16] Thomas: Yeah. And you can see also this time and there are more and more than, or start up working on data science and on-chain analysis. And that’s clear

[00:18:26] Milana: that’s a future.

[00:18:28] Scott: Very cool. And so are you guys looking to do some more blockchain data analytics projects?

[00:18:35] Thomas: Whilst . So, so at the moment we are looking at the helium blockchain. So for the one who those know that blockchain very briefly, it’s a, it’s a blockchain that leveraged a decentralized global network of hotspots. So let me explain hot spot is a sort of a modem with an antenna and it provides a long range connectivity.

Between wireless device that we often call IOT device internet of the thing. These device can be environmental sensors to monitor the quality or for our vehicle to help pick poles. It can be also localization sense, so to talk by bike fleets. So they are many application, many of these little device.

So, so even it’s interesting because people are incentivized to install hotspots by earning helium token. So they installed these, they buy this hot spot, they install it and then they earn money. And this is what allows the network to increase its coverage. So that’s a very successful project. I think you’re yeah, one, two months ago we were about 500,000 hot spots in the world.

And it’s just growing exponentially. So, so I personally liked this project because it does a practical use.. It’s not just about finance, like money or the blockchain project. It has real world application. So, so I have one or two spots two meters away from me. So, so we are,

[00:20:09] Milana: which is making

[00:20:09] Thomas: money.

So, yeah. Yeah. But that’s quite profitable. I don’t have to see, but so I’m looking back at the. I would say that looking at helium data is a whole new step.. That it’s not a simple curiosity. I believe people need maitakes and statistics to take good decision. I oh, they will find a work useful, but also to come back to your question or quiz and golden, I got to am project is to visualize the hopes of the network and to quantify it’s usage.

So this time, it’s a bit down there to get the data. Because to visualize the goals of the network you need, of course the data from times you up to to know. So you need all historical data for all it spots in the world. So that means that basically you need to give the full blockchain and this is once it’s in a database loaded and a running.

So I’ll take her by of data. So you, you, you cannot rely just on API and send a query to an API and receive a a small data table. It’s just a order of magnitude to be clear the, the size of the, this

[00:21:19] Milana: another scale,

[00:21:20] Thomas: but so stay tuned.

[00:21:23] Scott: With helium is the. I assume only a subset of the actual data is being recorded to the blockchain or does, does the helium node record, you know, device unique ID, time of usage and all of that sort of stuff as well.

[00:21:43] Thomas: So, yeah. It’s a, it’s a, you, you obviously a little hot spots. So each hot spots each liter Madame is an anti Nye act as new. But they done stole all the blockchain, of course, because it’s a, it’s a terabytes of data and you have on the other side you, you can set up a bigger also node which really early called all the data.

And these nodes are all the database from the big.

[00:22:09] Scott: Yeah, and they, and they have they, so those nodes are also picking up the usage stats, for example. So if there’s someone that’s connecting, someone has a device nearby that then is going to be, you know, leveraging the. The frequency coming off that Ariel, that is all being recorded as well.

So essentially when you have the historical data, can you start to build a map of, of all the usage on there or what’s the kind of have you had a chance to kind of get an, get an idea of the level of granularity to, to the data that you can get from the helium.

[00:22:47] Thomas: Well, we are very early stage of this analysis, but yes.

So the, the, the, the blockchain, he goes all the transaction. So when the device connect to you spot, it will transfer some data. I don’t think the, these, these doctor, despite the, what is transferred by the device is saved on the blockchain. But folks, shoe, what is saved on the blockchain is that there was a transfer form of.

And of course this transfer generates he reward in terms of hileum token. And that is saved on the blockchain. And also a while you have another layer, which is the co-heads of the network. So to ensure that you would spot is it’s useful it will communicate with other hotspots. And so if it can communicate with other hotspots, it means that your spot is set up at a good place.

And then you’ll be rewarded for having a spot at the good place. And so all this transaction reward are saved on the blockchain. So you can really filter that wild by, is it so he wanted you to a good courage or because you can say it, you will see the tough come IOT device or this is.

[00:23:53] Scott: Very cool. Yeah. So you can actually infer quite a bit just from looking at the how the rewards are distributed and, and by the nodes. Yeah.

[00:24:04] Thomas: Yeah. The interesting project, they also people would try to cheat. So they, they, they, they place their, the hotspot they, they, they see that the, the old spot is somewhere on the map while it’s not.

And then they go to, just to get some money.. And so I do. Yeah. And it’s supposed to be, but to detect these hot spots. So there are a lot of nice, nice thing to do about it. Nice data science project they’re

[00:24:26] Milana: We’re only scratching the surface. No. Yeah. That’s very

[00:24:29] Scott: cool. I so, well thanks for discussing this, this work with us.

And it sounds like you’ve got some pretty interesting stuff to come. If people are wanting to learn more about your work and potentially see future projects, where should they be?

[00:24:44] Thomas: Well, I think the easiest is to follow us on LinkedIn on Twitter. Also maybe you can subscribe to medium accounts to not miss the next article.

And the address is HTTPS slash slash Meagan medium.com/at TD matching. And TD marchin is T D E M A R C H I N,

[00:25:07] Scott: perfect. We’ll have like, said to those.

[00:25:12] Thomas: Okay,

[00:25:12] Milana: thanks. Thanks for having us. It’s been a pleasure. Yeah. And I hope we are going to do another podcast in the nearest future on the helium data analysis. We are really excited to get started on that one.