Longitude Sound Bytes
Ep 97: Data at Sea and on Shore (Listen)
Blake Moya
At the intersection of ideas and action, this is Longitude Sound Bytes, where we bring innovative insights from around the world directly to you.
I’m Blake Moya, Longitude fellow from the University of Texas at Austin. Welcome to our Longitudes of Imagination series, where we are exploring the roles of individuals, technologies and research that are helping advance understanding of our oceans!
.
We spoke with the members of the Schmidt Ocean Institute, which is a philanthropic foundation that is enabling scientific expeditions on their research vessel Falkor at no cost to the world’s scientists. As part of the UN’s Ocean Science Decade, they are also contributing to a worldwide effort in mapping the entire ocean floor by 2030.
In today’s episode we are featuring highlights from a conversation I led with Corinne Bassin, Data Solutions Architect at SOI. As a Statistics PhD student, I was interested to hear about how researchers handle the massive amounts of data collected so far out at sea. We started our conversation with her explanation of the kinds of data the research vessel Falkor collects.
.
Corinne Bassin
My academic background, way back when, is a mix of mathematics and oceanography, in general or science. So I have a bachelor’s in math, and a master’s in interdisciplinary marine science. Over my career, I’ve kind of straddled the areas that are more applied mathematics, and Earth Science, both in academic institutions as well as government. I’ve worked at the National Oceanic and Atmospheric Administration, and I’ve also spent some time in the tech industry. And so now at Schmidt Ocean Institute, it kind of brings all these pieces together. For me, I’m the Data Solutions Architect. I’m basically working to help create a more standardized pipeline for the data we collect on our research vessel, and getting that data all the way out for students, scientists, anyone really. Open data to be used and made accessible.
Blake
So can you tell me a little bit about this research vessel and the kinds of data that you are cleaning and preparing?
Corinne
Sure, so the vessel that we’ve been using is the research vessel Falkor. And I’ve only been a part of Schmidt Ocean Institute for about nine months, though they have 10 years or more of data. So I’m still, as you can imagine, going through oceans of data, to get up to speed. The vessel has a variety of both Oceanographic and Atmospheric sensors that are always available when the ship is out, and also scientists may come aboard and bring additional sensors as well, or systems. We also have the Remote Operated Vehicle, Subastian, which is tethered to the vessel, but can go down 1000 meters and take video imagery, collect other samples, literally collect samples off of the ocean floor, collect water, and has sensors for getting things like temperature, salinity of the ocean, etc.
Blake
Wow. And that sounds like a lot of sophisticated machinery. So my first thought, as someone who also has to deal with data cleaning as a statistician, is, is the machinery sophisticated enough to come to you, or for the data to come to you in a pretty nice-looking form already? Or are there some common issues that come up?
Corinne
Right. I mean, whenever you’re collecting data, the sensors themselves generally have their own ways of spitting the data out. Most of these sensors have a variety of ways of putting data out, whether it’s in binary form, or just text data to files. We have some software and we’re working on building out more to interconnect these systems and make a more standardized pipeline, both to get the data out into files that are readable for both humans and computers, as well as putting them into databases and taking a step towards creating analysis-ready data.
Blake
Yeah, I worked for a time in neuroimaging, and the lab that I was a part of had joined a data-sharing consortium where all of the data that we collected and added to the database would have to be reformatted to match exactly what the guidelines for the Consortium’s data were, so that it’s more useful for other people.
Corinne
Right, and a lot of times each of these sensors may be taking data in different dimensions, and so you need to deal with- we haven’t even touched on the quality of the data itself. But just getting it onto more standardized time and distance and depth dimensions so that you can use some of the data together. But initially, you also just have to make sure that all the sensors are putting data out, and they haven’t turned off or run into other issues, or are running outside of their calibration or accuracy standards. So there are a lot of steps involved for sure to get there.
Blake
I believe, if I understand correctly, you have a bit of a knack for data visualization, a bit of an interest for it.
Corinne
Yeah, I love data visualization.
Blake
I do too. And I think that there’s a really hard balance to strike between what is easy for a machine to read and communicate, what’s easy to transfer between different computers, and then what’s easy to communicate from the computer to the person on the receiving end of the information. So do you want to talk a bit about how you tackle those kinds of issues, and balancing the communication of what I imagine is very high dimensional, very precise data?
Corinne
Yeah, it’s an interesting question. I see those two as pretty separate pieces. In some ways, although there’s a lot of steps involved in making data flow through a pipeline, and standardizing it, there’s a point where you can say, okay, here is how we are going to standardize the data and make it machine readable. Or you can create an API or say that you’re going to put the data into a JSON or CSV or some kind of format, and you’re going to let other computers know that this is the format, and this is how they get the data. So although that’s challenging to do, you get to specify it. When you’re dealing with humans, even when you think you’ve created something that’s clear, you never know how people are going to interpret it. A computer, you understand how it’s going to interpret it, but people you don’t. So when it comes to visualizing for actual humans, there’s a lot of different things that go into play. I mean, obviously thinking about who is your audience? And what is their use case? Whether it’s someone who’s more technically advanced, or a student, or a policy manager? But also what do they want out of it? And then also, what are you trying to get across? Sometimes we make data visualizations to bring someone into the data to explore and not necessarily to make a point, but maybe just to view and be pulled in by the data versus telling a specific story. People oftentimes think, well, I have this data, this is the kind of graph I can make with it. But realistically you need to think about not just what you have, but what are the questions? What’s the story? Is there a story? It basically becomes a journal article in and of itself.
Blake
Yeah. And I think one large thing, too, though, is you do want to first get people to want to get more involved, especially if it goes to a lay audience, right? To build interest in something like seafloor mapping, you first want people to want to read more about what’s happening. And I think the visualization serves a huge role in actually sparking the initial interest into reading about what is ultimately a bunch of really large spreadsheets.
Corinne
Absolutely. And I think a lot of people will underestimate the really simple pieces of beauty when it comes to data visualization. But thinking about fonts and colors, and whether or not someone is going to be colorblind, or how it’s going to affect them, is a really big step in getting someone to continue to read or look at that data visualization. I think especially in the sciences, earlier in my career, and until recently, a lot of times that wasn’t given as much weight as it should have been in terms of creating some of these visualizations. We would say, well, it’s just for science. But each of these pieces are important, and how even subconsciously you take in the data.
Blake
Yeah, as a statistician I think the whole science comes down to, you have a bunch of data that is not interpretable to humans and you have to get it into a human digestible format. People choose different ways about what the important numbers, the important statistics are to communicate, what the good graphs are to communicate. But the end goal of science as a whole—I might be biased, because I’m a statistician—is to get that communication of what has been discovered or what has been collected by these machines or these vessels.
Corinne
Something else that’s been interesting in the last few years that’s really happened in our larger society with data visualization is that we’re starting to have a more informed audience in terms of data visualizations, especially because of COVID. Prior to COVID, you didn’t see nearly the amount of graphs and maps and things that now are taken for granted, so you don’t need to necessarily explain what the axes are and what they mean. And just having some of this general basic knowledge really changes what we’re able to do with our audience.
Blake
Yeah, the discussion of log scale on graphs at the beginning of the pandemic was a huge one.
Corinne
Right. Exactly.
Blake
Also, this is just a technical question for myself, what programs or what software do you use to generate your visualizations?
Corinne
Yeah, great question. It depends, is the answer. And new things are coming out all the time, and it’s getting a lot easier. I very much used to work a lot with just hand coding in D3, which is a JavaScript library for visualization. Even the original creator of D3 has now moved on to make it simpler to use D3. So it’s getting easier and easier. I am a Python and JavaScript programmer. So there’s a variety of tools available in Python as well, like Bouquet. But sometimes I do things simply in Google Sheets. It just depends. It depends what I’m trying to do.
Blake
So I know that you had given me a disclaimer about your involvement on the Seabed 2030 project, but if you can, would you share what you know about Seabed 2030?
Corinne
Sure. Generally, Seabed 2030 is working as a consortium I think, but with a variety of groups involved, to try and map the symmetry of the ocean by 2030. Less than 20% of the ocean floor is currently mapped. It’s crazy to think about that when you look at how much we know about Mars, really through a combination of both institutions and academic organizations, but even individuals with vessels who might have systems on their vessel for mapping bathymetry, trying to process all of that data and make it available for a variety of uses to move science forward.
Blake
And so the data that you work with of the collection from Falkor and Subastian, not bathymetry?
Corinne
We do have, no, we do- we are definitely part of the Seabed 2030 project. We collect bathymetric data from the RV Falkor, and it is very large data in terms of how many bytes you collect, and it needs to be cleaned and processed, which is a challenge. But we work with other partners who are very good at that, to process it and then put it out into public repositories and send it off to Seabed 2030 as part of that project as well.
Blake
And maybe a question with the silly source that I just thought of, but I imagine while all this data is being collected on the vessel, it’s probably difficult to wirelessly transmit it back to land. So is it wirelessly transmitted? Or is it that it travels by sneakernet? And somebody just walks on the boat and grabs the drive and walks it back?
Corinne
Yeah, this is specifically a problem with bathymetric data because it is very large, but really all of our data. All the data collected, some of it is small, and you could send it wirelessly potentially. Not that long ago it was very common, and still is pretty common, for the data collected on oceanographic research vessels to just be put onto a hard drive. And then once you got to shore, the hard drive is walked off the vessel and sent somewhere. We’re currently working on strategies to make that happen more livestream, especially with new satellite technologies that are coming out. But we’re definitely way in the early stages. We currently use a combination of it. Sometimes we’re able to get some data off the ship wirelessly, and other times we use a system that brings it off the boat on a type of hard drive, but then is put up directly into the cloud, as opposed to handing it to someone. It’s a little bit of everything at the moment. But the amount of data collection and the ability to connect to the cloud when you’re far off on the ocean is an issue.
Blake
I realize now how much I take cloud-based technology for granted because it’s been so long since I’ve needed to actually use a thumb drive. But I remember when people first started talking about- I was like, the cloud, where is the cloud? I don’t know if I trust it. And now it’s pretty much everywhere except for, I guess, in the middle of the ocean.
Corinne
Yeah, it’s interesting, because one of the issues now is that most data architecture systems, or companies that build things, are just trying to make it so simple. Oh, just use our cloud infrastructure. But we don’t always have access to that. And so we have to use a combination of older school methods of doing things, but also have it available to sync to the cloud when we can, or do a little bit of both. So sometimes organizations and businesses move forward quickly, but not everyone is able to take part in the systems they’re building.
Blake
Yeah, I remember hearing about the difficulty in bringing high speed internet to offshore oil rigs. Being from Houston, you know, they’re all out there, so you just hear a lot about it. But one of the things was that the people that are working on the rig, they need break time and you want them to have at least high enough internet to go FaceTime with people back home, but that was a struggle to get a network of that capability out only as far as just beyond the shelf. And even then these rigs aren’t speeding across the entire ocean. So it just leaves the problem of how difficult it really is to connect and network globally, not just overland, not just over static bodies, but even when you’re moving all the way around the globe.
Corinne
Right, exactly. And Schmidt Ocean Institute is definitely looking into new and different ways and really wants to push that ability and make it more possible, not just so that we can get data off the ship in real time, but also to allow people to be effectively on the ship that aren’t on the ship, to allow scientists or people from places that maybe wouldn’t ever have access to be able to really see and be a part of the live data streams. And it was something that I think was of interest before COVID. But again, COVID shows us why it’s so valuable to be able to do things like telepresence, and be able to have people in different places and be a part of that data.
Blake
Yeah, to actually use the global network we’ve been building for the last 30-40 years. One thing that we like to hit on in these interviews is what you think propels innovation, or how you think of new solutions to problems like, you know, mapping the ocean floor and going global with this consortium-style data effort. What you think drives these kinds of solutions to problems in people?
Corinne
Gosh, I don’t know that I have a good answer to that. I think right now, there’s a variety of different things going on in the world. But for me personally, what I’ve seen in the last, I don’t know, 10 years, is a pivot towards more of an open science, open data, open world kind of platform. Meaning that as anyone puts in all these resources to explore the ocean in these vast areas that we know actually very little about, that we’re not just saying, I’m the only one that’s gonna figure this out, but really, if one person goes and collects some data, that they understand that there’s more that can come from the value of sharing it and making it open, and allowing others to use it, especially 10 years down the line, 50 years down the line. It’s possible someone else will use that data in a way that we could have never foreseen. And so the more and more we collect, and the more and more we make open and accessible, the unseen value that we can collect from it down the road.
Blake
Yeah, I think that’s very true. It makes me think, again, I’d mentioned earlier, we’ve spent 30-40 years building up our internet architecture, the global network, and I remember hearing a long time ago that something like 60 to 70% of websites on the internet use this one little script that had been put out somewhere at the core of some CSS script a long time ago. And it’s just spread and proliferated, because they shared it. And it was useful. And now it’s everywhere, right? Which we can see the same throughout. Not just technology that we use for commerce and communication, but science and information knowledge as well. So what do you foresee? Or what do you hope to see as the direction of this kind of ocean science in the next four to five years, both through the communication and the technology?
Corinne
That’s a great question. I think that one, I’m hopeful that we’ll be able to do more as a community, to really bring the public into understanding the value of understanding the ocean, not just inherently, but for what it means for the earth and for climate, and that all these systems are highly interconnected. And I think we do that through a variety of visualization and storytelling. And also, like I said, making data and the science available in a way that it doesn’t feel like it’s held in a tall tower, a tower that is away from everyone. I think we’re finally turning in that direction. And I think it’s also really exciting to see that the up and coming students and new researchers that are coming into oceanography and Earth sciences in general, all have these mixed backgrounds, tend to have ability in computer science that never was there before. And I think we’re gonna see this huge power with the data and software and science moving forward.
Blake
All right, I had expected you to say, hoping for some data standardization to make things a little bit easier, but it’s good. I like the inspirational route, that is beautiful to hear.
Corinne
I’m a little jaded on data standardization, I think. I think it’s more likely we need to all just be aware of what our data actually is. And know that, you know what? Messy can be beautiful, too.
Blake
Yeah, you’re right, it’s actually less optimistic to look forward for the public understanding of the ocean as a whole, the opening up of science, than it is for people to just pick a standard and go for it.
.
Blake
My talk with Corinne helped me make an important connection between the sentiment about communication many data scientists share and the sentiment of openness in Corinne’s remark “open science, open data, open world.” Data science is primarily a communication problem: the communication of relevant insights from difficult-to-interpret data. I think people like Corinne and I, who enjoy being able to distill and deliver these insights, want to see them delivered far and wide because we enjoy and value sharing knowledge. It’s no wonder, then, that so many data scientists that I know are so interested in open source programs, open access data, and open access publications. And who can blame them? After hearing from Corinne just how much hard work and creativity goes into developing science like this, it’s no wonder she’ll enjoy seeing it shared by people across the world, who’ll soon be able to peer into the lowest depths of the oceans that surround us.
.
We hope you enjoyed today’s segment. Please feel free to share your thoughts over social media and visit Longitude.site for the episode transcript. Join us next time for more unique insights on Longitude Sound Bytes.