Tugce Ozdeger: Datatera - Reducing Bias in AI Models

We explore how Tugce Ozdeger, founder of Datatera is using Ocean Protocol’s Compute to Data feature to reduce bias in AI models in the Healthcare industry.

https://www.Datatera.se

https://twitter.com/Datatera21

The following is a rough transcript which has not been revised by Ocean Missions or Tugce Ozdeger. Please check with us before using any quotations from this transcript. Thank you.

[00:00:00] Scott: Why don’t we start off? And if you could just tell me a little bit about yourself and you know, how you became aware of ocean protocol and what was it about ocean. Made you want to explore it further?

[00:00:18] Tugce: Yeah. I’m, I’m really excited about this talk, actually. I’m thank you for having me. And if I introduce myself , I am a senior software engineer and I am specialized in dot net. Microsoft technologies. I’m very coming from enterprise level. Corporations. And the way I have found out ocean protocol is while I was actually searching for a technology that can give us the opportunity to make those data sets available without compromising any privacy.

Because that was my concern. When I have been working with data scientists that they don’t have they don’t have access to. Large samples of data when they are building their models. And that caused a lot of problems that, because the knowledge is very limited, which means that whatever they produce is very biased.

And this has a very direct impact on the, I do systems that aren’t where I coming from has also some problems because we are based on those models. And I was in west negating, like the reason why the people who owns that data sets on data, they don’t want to make them available. And it’s very understandable because there is no actually a platform where they really trust and they know that whatever they do access to it will not be breached or will not be exposed to.

And this is some kind of actually, because the people that we are targeting with my project is actually tech people. It people, data scientists, and the problem we tackling actually tech related. And when I was searching for blockchain, because I was really interested in blockchain. But I was more like into the financial aspect of blockchain, not that much from the technology perspective.

And that’s how I get to know ocean protocol. And I like seeing that this computer data technology was released recently. And it’s the reason why they have build this. So it at the date. We’ll never be copied or moved to anywhere while we’re running jobs, to be able to train AI models and learn from it because the purpose is not exposing data or trying to stop, trying to understand what the, what is inside is more like learning from the patterns and data points to be able to improve the models.

And that was what I was looking for. And when I see ocean protocol, I got really excited. That the technologies are, are already in place. And and it’s opensource, meaning that there is an opportunity to learn from their experience and also improve and develop further. So, yeah, so that’s how I come up with this.

And then I joined discord and I see that there are many people like a community, you know, because I’m coming from Microsoft side, you can imagine that there is a very big community around us. And when I see that there is like quite few people in this court and everyone is very helpful. I become really happy because it’s a new technology.

There are lots of things to learn and also challenges, especially a person like me running windows. And it has been quite a long time. I work with Obama to, or Linux a braking system.

[00:04:20] Scott: So I mean, there’s, there’s, there’s a number of pieces there for sure.

And, and it’s really interesting that you know, you spoke with these entities who maybe had some data and, had the issue of. Around, you know, where do they put their data? And they can’t be sure that, you know, maybe that’s going to be exploited in some way, shape or form.

So, yeah, I definitely see how, you know, how you, you landed on ocean. I mean, every, all of these points sort of address some of the. Fundamental, characteristics and propositions of ocean. So I’m very new to the work you’re doing. And, your project. And, and I know that , the goal is to reduce bias in AI models.

How did you come across this problem? So was this something that you were, you know, looking attempting to solve inside of your, your experience and the enterprise level, or was this more of a passion project that came about? Or how did you sort of sit along on that, that space and, start pursuing it further.

[00:05:30] Tugce: I mean in my working experience is a devil upper in corporations. For many years. I work close the week, BI developers. And recently I also have the opportunity to work with a data scientist after I quit my job and be a full-time entrepreneur. So I have an experience as a double developer, how important to build maintainable, sustainable life to systems that is for sure that I have been experienced for quite long time.

And when I work with data scientists, I understand their struggle to understand the data, to really build a model. Doesn’t matter what kind of work tickle they have. And, and I’m like contribute in the field that they will deliver this model. But since they don’t have this access, because I was trying to find out like a data set a date, like data points, which will really help them out because we were in the same team.

I found some data sites. I’m there are data sets available in public. The thing is that it’s, they are incomplete. So it doesn’t give any, any clue about like the lifecycle of data. Because when I work closely with them, what they would like to see is the whole life cycle, they really understand the different phase of the data is, is traveling when it’s started to create.

And when maybe it’s done and everything is completely. And I unfortunately couldn’t find any any, any datasets, whether what I found is like incomplete, maybe a portion of the whole journey. And then I really understood that this is a problem, actually. And here’s, I knew that when they build a biased AI model, these have a very direct impact on the it system, which is based on that model.

Right. And there’s maintenance work where the system will crash at some point because it doesn’t, it’s biased. It meaning that it will not work for a particular case. And then this will be like a maintainance job because then they need to fix the model for the new case. They didn’t know when they built.

So I have understood the way they work because I, myself, I know we’re workers, data scientists by I cooperate because my job has been producing the data and they, your job is to try to understand the patterns from the data. So we have different ways of seeing or contributing it. And I understand that these would be like better rights.

We will give them more resources while they’re building the AI model, not like reporting a bug that this is not working and they need to be adjust their model and, and make a release. And then we will test it out again. So that was the problem. And I, when I was starting to understand from the other side, why they don’t want.

Like give access to this data. Then I realized that. Like we are contributing to this field, this technology, but at the same time, because of lacking technology in a way we don’t trust. That’s why we don’t have this some kind of a platform to like give access to these people. They can learn from it and we can reduce this maintenance site.

And they learn and train their data. And at the same time, from the other side, they will not like deal with this like a back and forth cycles of it maintenance, because from my point of view, it systems needs to be sustainable and maintainable. It doesn’t have to address today’s challenges or requirements, but also the future.

So that is how I actually. Get myself accounted with this challenge. And like I said, I have always have some kind of an interest in blockchain, but more like a financial aspect. But when I really dive deep into this ocean protocol and and the mission that they have, I really like it.

And I really see that this is actually like. And there could be many people who can get benefit from this idea because one of the biggest reasons by bias happens in AI models is because of the lack of data because we humans, we are also biased in a way, and we are producing those AI models. If we don’t know that we are biased, to be able to know more like to get more diverse data, we need to have more.

And also qualified data, not like the amount doesn’t really matter. The quality of the data that really gives you some clues about what more you can achieve or improving your model. And that is what we promised. The trusts that whatever you give us access, we’ll never ever share it or expose to anywhere because once it’s in the chain, it’s secured and your data will not copy it or move to anywhere.

And it does same time, the data, the quality of data that we have access. And we deliver to our data scientists. We’ll also get satisfied that they really see the result and they understand that, okay, now I learned something new today. I didn’t know that my model cannot respond to this particular case for.

So it’s not just trust trust, of course it’s part, but at the same time, what we deliver as a data or the, like the result of that compute job also give something to the data scientists that is also something we promise.

[00:11:48] Scott: So, so one of the things we connected on was around ocean protocols, computer data feature and we sort of indicated towards it a few times.

Do you mind just explaining briefly how the computer data feature is going to help your use case.

[00:12:06] Tugce: I mean in our use case, the people that I talked to the their concern is about access.

Like when I pitched this idea and they get really excited, but the critical thing is like the access. What kind of factors are you talking about? Because the work, the code that we are targeting is very special where to go, meaning that they have very. Private data that they will never ever get any risk at all, that this will be exposed or land in the hands.

Some people that they are not supposed to say. So that’s why they really try to understand on what I feel also. People actually don’t know blockchain where, well, I mean, I am not an expert by any means, but when I talk with people and as like the people that I talk, there are not technical persons and they feel like blockchain is like a digital token.

They really don’t know that it’s also a technology that can actually serve as if you use in the, in the right way. So when I tried to explain them that when we have this access, it’s just to be able to run this compute job. And this compute job has a mission to rom the algorithm that is provided by the data scientists, because it’s, they are a model to train the data that they, and we decided based on the.

Feedback we get from the data scientists. And then when we run this shop this algorithm creates some kind of the, and try to understand the patterns that they decide this potentially didn’t know with the data sets that they work with. And this will provide a model and also the logs, like everything that.

It happened during that that wrong will also be provided to the data scientists. They understand that it’s not like sharing data because when, when you use that word, everyone is getting, get really frustrated. And now it’s not sharing. It’s a, it’s like giving a pportunity to learn from the, the data points that you own so that we can.

Great like models that can be sustainable and more stable when it’s delivered to any it systems that they chose. And and when I also feel like the people own store those day to day, really don’t know, like, what are the things that are most important in their data for the data scientists. That is also another question.

Like what kind. Data points that they would be interested when they run their algorithm and what kind of feedback they will, they can provide us so that we can see like what. Portion of the data or what kind of data would be maybe accessible on the platform. So adapt algorithm can really catch those points that will be valuable for data scientists.

So there are lots of discussions about. Like the valley we, we will deliver and especially the access part, which ocean protocol, the technology adoption protocol delivers us to explain like how this compute Joe works actually that when the dataset is public. The location and everything is actually encrypted.

And when it is added to a chain, then it’s never ever be seen by anyone. And I’m trying to explain it as simple as possible. So they really understand that, eh, even V from this idea, we have no clue about the data and we will never see it on our arms. No quasi. The only thing we know about the content, like the metadata that we need to know, because we need to let data scientists know so they can decide if that is something that is valuable for them.

And also in where I should order this compute job so that it will go there and run the job and then produce. Results. And it is, this is like this is something like they really get excited about. This is something like they were very looking forward for many years. I, I feel because bam, they feel calmer.

And, and ask me to explain more about what are the things that I need to be able to. Like chest this technology.

[00:16:48] Scott: With your use case, are you comfortable sharing anything, any more sort of details around the use case, particularly use case you’re, you’re looking at, at using this.

[00:16:59] Tugce: Yeah. I’m speaking to two different companies and we didn’t decide yet, like, which one, because you know, this is like holiday season, so it’s not decided yet, but it’s health tech health care industry. Actually targeting. And of course healthcare is very big industry and there are lots of different niche on there.

There about industry and, and the one that we have been speaking for some time is is a counterpoint adapt, delivers medical images. And they have quite big costumers and and they really feel that this kind of a particular technology that really can give us the opportunity to train those models and make more data sets available in a secure place.

It’s a valley for them because, you know, as technology is evolving, we’re producing more data. And the more we use AI, the more we use different technologies like machine learning, deep tech. And this means that the more data will be available, and we will come to a phase where we don’t exchange a Karen’s years, money.

We will exchange date. So they really see that this is not just for this year, but this is the future we are talking about. And blockchain is also the technology that is targeting the future it’s used rather now, but yeah, and they understand the valley. And that is something we have been like, I like all the dimension, the access, like what kind of access.

And it really depends on how they store data. Like, is it like on prem, some companies, they all also have some kind of a cloud connection and DevOps kind of the way they store the data will also. Like effect on us to give them the, this is the access we need, depending on the way they store. And depending on, like I mentioned already, what portion of data in case they have like very big portion of data may store, maybe according to the feedback received from the data scientists, what is it?

They are more interested in that. We’ll give it the biggest value when they train their model. So there are different options we have been talking about. Maybe that you can create some kind of a designated server or some kind of a location where you just gave us access to one kind of portion of your data, where we can only train the models.

Or make depending on how sensitive and they can, what, what the data contains and what kind of policies they are, they need to follow up. So there is also legal part of this solution that we try to make them as easy as possible from our side and from their side and from the technological perspective as well.

And also how we can monetize this. That is also something that maybe we are not that focused on right now, but this is also something that we will definitely be speaking when everything is in place and everything is working, how we can monetize so that this will be like business from there side and from the people who would like to.

Consume the jobs. So that’s why I really try to pick the right words. It’s not consuming data, it’s consuming or running jobs, you know because when you. Sometimes you, some words, people get really frustrated, like sharing data or something like that. So it’s like more of a ordering a job to train, to learn from the data.

It’s not sharing data, it’s not downloading data or something like that.

[00:21:03] Scott: Yeah. That’s really useful advice. And, and you’re right. You know, there is definitely trigger words that Sometimes bring up certain defenses when, when it’s not necessarily what you’re referring to. So recently you got.

Compute to data working in your local environment. And I was wondering if you could maybe share some of the challenges that you came across or that you had to overcome in doing that? You’re definitely, you know, early in terms of, of working with ocean protocol and working with compute to data, Just if there’s any sort of advice or guidance that you would have or recommendations that you would have for someone who was, was looking to do the similar thing themselves you know, maybe what what are some of the things that you would, you would advise them or share with.

[00:21:56] Tugce: Yeah. Like I mentioned from the beginning, I had the very exciting journey to rom this on my local environment because I’m running a windows machine and yeah, these the solution is, or all the documentation at least is for the Linux users. And so what I have done basically is to activate a WSL online, being lost 11 machine, so that I can have access to a Linux kernel.

And WSL is a windows subsystem for Linux. And and then I try to understand like Cubanetis and mini cube. There are lots of libraries that I need to install, first of all, to be able to get my hands dirty and start to run some commands. And when I start to like installed the libraries that I need, I see.

And not everything is working as it should, as it is, as it is written in documentation. And there are some other laboratories or some different extensions that I need to install. Maybe it’s not updated on the documentation or maybe it’s because gauge I’m using, you know different set up. And so I tried to like read what I have done.

I tried to understand first off. The things that I need to install the things or how the architecture actually has built for this solution. I mean, computer data, like what technology I need, for example, I didn’t have Kubernedes on my machine and then mini cube, and then it’s more a fight commands that I need to execute.

And then I tried to understand, like in which order I need to install. And then I also ask a lot of questions in discord because I was trying to leverage their experience instead of sitting like stuck. And don’t know what to do is like, instead of doing that, I’m asking questions like, this is what I have done, but I got this, what will be my next step?

And I didn’t give up because. I knew that people are already implemented this, this means that this is working, so it should work on my end. But since I’m using a different operating system, of course, that could be some challenges and do more. I actually spent out, of course, I spent a lot of more time than it may be a usual Minox user potentially, but.

The more, I, I have problems. I felt like I understood it more deeply about the things that I need to do to be able to, for example, create a wallet, the publish, the data, set first and get some tokens and then published algorithm team. And then I, I feel like I get used to with this errors or like another dire also some waiting times to bring.

Create those like assets on the chain. It took some time. And then, like I said, I have told me different people get some advice, both from Piketon side and also from the like operating system related problems, some libraries or things that I’m doing. And I’m actually I have and rom the compute job and it’s succeeded.

And and then I see the model and everything worked and I have done everything on my windows machine. I, I dunno, maybe I’m the first one because the guy said, I talked to, which I really appreciate their time and they are held. They said that they didn’t know anyone run this on windows machines. But they also actually get really curious about my progress.

You know, that if I can make this on on windows. Yes, I did. Like using, like I said windows subsystem for Linux Linux kernel. So it worked out, but there are things that was not actually working as it’s written into documentation. And I really understand that it’s not made demand to be for windows users.

I got trainee help from the documentation. And also I asked her the depressions. I know. Felt like I shouldn’t ask or try to deal by myself because there’s a new technology and this is not my tech stack. But of course I have I have an understanding of what could go or walk them to wrong or what could go wrong, but and leverage their experience because they have already done that and they can support the error easier and faster than I do.

So, yeah, so I’m really grateful for everyone that I really get help from the community and their generosity to game day or time and tried to understand what I’m dealing with, what kind of problems I had. And yeah. So if you’re there, we’ll put, you know, that there are challenges on the way, and especially when you change your tech stack alone and get new technology.

[00:27:22] Scott: You may in fact be the first person to get it running on windows. Which would be pretty awesome if you were so for the anyone that’s listening and they may want to get more involved. With Datatera or learn about what it is that you’re doing.

Where we’re is, what’s the best way for people to follow your progress and, and see more of, of the work that you’re, you’re going to be doing with Ocean.

[00:27:51] Tugce: And they can contacted with me on discord and we have a page www.datatera.se it’s a Sweetums domain code. And we also start to be as social as we can on social accounts.

So it’s the same handle name datatera.se.

Yeah, so just feel free to get in contact and have a chat and let’s talk and spread this technology with others.

[00:28:26] Scott: Awesome. Thank you so much. And I look forward to, to seeing how things progress from from

[00:28:32] Tugce: here. Yeah. Thank you. Thank you very much. I’m having a nice of your name.