By Francie Fink
Dr. Marynia Kolak is an assistant professor and researcher at the University of Illinois Urbana-Champaign. She is Principal Investigator for the Health Regions & Policies Lab. When Kolak describes her work, she paints a picture of how geography, data science, and health are deeply intertwined, informing each other. As a health geographer and spatial data scientist, Kolak is committed to understanding how where we live and work shapes our health outcomes—a pursuit that has led her to develop truly innovative tools and insightful publications that bridge gaps in policy, research, and community advocacy.
You’re involved in lots of things on campus. Could you elaborate on your main priorities, for your research and other activities?
I’m a health geographer / spatial data scientist. I use data science techniques, methods, and approaches to solve problems related to health and where we live. We know that just based on where you live, all the neighborhood structures and demographic trends will explain over half of a lot of different health outcomes, which is pretty wild. For example, without any genetic data or access to clinical records, our team found that we can predict over 60% of premature mortality in Chicago. There’s a lot of complexity to why that is.
Way back, when sanitation and infrastructure looked very different, there was an even stronger relationship between where you lived and your health outcomes. As a lot of medical advancements got better and we were able to provide better sanitation, food quality measures, air quality measures, a lot of the standards became elevated and health got better. But now, we’re actually seeing a trend back to that in a lot of ways. A lot of scholars are finding that there is a widening gap in health inequities. The pandemic kind of exacerbated that.

Now, it’s more about who has access to resources, like high quality health insurance, physicians nearby, but also things like public transportation and who has access to job opportunities. That connects them to different resources that other folks might not have. Historical legacies of discrimination, through redlining, racialized real estate covenants—a lot of those will persist even through the initial framework despite legal changes. And a lot of those actual outcomes will persist many years later. My job as a data scientist is to use health geography to frame conceptual questions and theoretical frameworks to understand these complex systems. I try to understand how all of these things fit together and what might be driving how outcomes differ for different people.
I also work a lot on the opioid epidemic. I use health geography to inform sociological events, but then I’ll use data science to really investigate. I’ll spend a lot of time on how do I build a really good access measure that captures how far away something is and how can someone access it? I obsess over access a bit. How do we put all of these variables together, and actually model the environment?
I come from a more strict tradition—if you just throw in hundreds of variables, it’s kind of like “junk in, junk out.” We want to bring the science to data science and think more critically about what variables are going in. We’re capturing distinct phenomena that we’re trying to test against theories so that when we do get results we can actually make sense of them and further the theory. If I just cared about prediction then all of that wouldn’t matter. But because I’m trying to advance the theory of health in place, it adds all these different limitations, which I love. It becomes more of a puzzle, versus if you’re just trying to optimize results. Then, I use different machine learning approaches to create new measures that explain distinct dimensions of the environment.
So what is your general approach to determining how health outcomes contribute to health inequalities?
Well, you might put in a bunch of variables and out of that, get five distinct dimensions or components that reveal different things about the issue. I’ll work on interpreting those, and then use those indices with health outcomes to see how they contribute to health inequalities. But then I’ll add real-world complexities. If I’m looking at the opioid epidemic, for instance, a lot of data we need doesn’t exist at the scale we need it. Like, how do we measure “stigma”? I explore different variables, create measures at the neighborhood level, and work with colleagues doing qualitative interviews at the individual level. Then we’ll see how well we did, validating against lived experiences.
For other work, like if we’re looking at environmental injustices, the relationships can be more complex and evolve over time. In these cases, it may be important to integrate diverse datasets and make them accessible, often through web applications like ChiVes, co-developed with community partners. We’re bringing all this data together, and working with community groups to figure out Are we totally off? How can this data be useful for advocacy?
Sometimes, the underlying inequities—like segregation in Chicago—play a bigger role in health disparities than the environmental factors themselves. The 1995 Chicago heat wave, for instance, resulted in preventable deaths largely due to social isolation and lack of access to air conditioning. So that’s why there can be a kind of funny causal pathway that we’re trying to treat. And the other part is for different topics. We don’t want to wait 50 years to prove without a doubt that this impacts that. A lot of groups need data now to make decisions.
Are there other scales outside of neighborhoods that you work on for your research—and why might that scale matter?
Are you a GIScientist? Because that’s what I’ll call cross-scalar dynamics—what happens at different spatial scales? And when you are interacting between two scales, what new things emerge? I’ve worked a lot at the neighborhood level, which might be somewhat equivalent to a Census tract, but I’m also really interested in how that interacts with individual experiences. That’s where I collaborate with colleagues doing qualitative work at the individual level.
Policy, on the other hand, tends to happen at coarser resolutions—county or state levels. For example, how does a state opting into Medicaid at the state level influence health outcomes? During COVID, we saw this play out in real-time: states with opposite policies side by side, like California and Arizona. People crossed state lines for services, and we even observed hotspots emerging along highways connecting those areas. So, rather than starting with a fixed scale, I begin with the specific phenomenon I’m investigating and then figure out where the variation might come from. Is it policy-driven? Shaped by local transit systems? Some other factor? That’s part of the puzzle you have to put together.
“We don’t want to wait 50 years to prove without a doubt that this impacts that. A lot of groups need data now to make decisions.”
You mentioned the web platform ChiVes you worked on for Chicago communities. For that tool, how does your research inform policy decisions and policymakers?
We’ve seen those policy connections in a lot of different ways over time. It’s been interesting because our task is literally to help bridge the gaps between needs in the policy space. But for example, with the ChiVes web application, one of the early prototypes that helped inform that was developed for the Chicago Department of Public Health, and it was actually used to help guide tree plantings to combat climate change. And really, all these things I’m talking about take huge teams to work with. So I’m really a manager and team lead from that aspect. But that project helped open a massive budget to bring forward tree plantings. The tool still exists within the City of Chicago.
Can you talk a little more about the range of applications for the ChiVes tool and other similar applications you’ve developed?
I know that it’s been used for teaching in different Chicago schools at the undergraduate and graduate levels to help inform the next generation of policymakers. We know at least one or two community groups are using it to help build their advocacy plans. One of our goals was to make it easy to provide stats, compelling visualizations, and all of that. Stories are essential for making policy, but so is data. One of the applications most compelling to me was actually a collaborator in Wisconsin. His family lived in the Mississippi Delta. There’s a lot of concern for access to high quality health care in the South, especially because so many hospitals were closed over the past few decades. So he used the U.S. Covid Atlas and worked with the local NAACP to present how communities are being affected in disproportionate ways there.
We’re also active in the policy space for different medication access issues. Making medications that are crucial to saving lives is essential. One of the last papers we worked on focused on a bill in Congress that aimed to expand access to methadone, which is one of the most effective medications for preventing overdose deaths in the long term. That research was ultimately published in Health Affairs Scholar and played a role in policy advocacy, including discussions in congressional hearings. In that case, an interactive web application wasn’t as critical as having clear, well-structured visualizations that conveyed specific, direct information, along with associated statistics presented in a table format. The target audience for that work had different needs. There is no single template for effective communication. There’s a need for scientists in this space who can communicate in diverse ways. Maybe it’s my MFA—I have that interest in the art of communication to really try and empathize and understand who your different audiences are.
What do you see as the biggest opportunities and challenges in making data more accessible, particularly with the types of datasets you work with? We’ve touched on some of these issues, but for those who may not be as familiar with larger datasets, navigating federal or census data can often be overwhelming. What are some of the barriers to access, and how do you think researchers can work to overcome them?
I would say now I have a very different response than in the past. Now, a lot of data is available for download on the Internet. It can be extremely difficult to actually access and work with, though—making it available is different than making it truly accessible. If you think about access as a multi dimensional thing, you need to understand the jargon. You need to understand the data formats.
I work with spatial data and that’s not CSV. If you’re working with shapefiles or raster data, that is all super specialized to specific groups. The accessibility of data and the research about it is pretty far from where I want it to be.
I think the lack of human centered design is a big part of this because it’s generally just folks dropping data or making code available and that’s it. To actually make it meaningful to a variety of different audiences, you have to understand who your audiences are, right? Who’s going to be accessing this information? What are their goals? What are they searching for? So a big part of our lab is trying to look at accessibility from that perspective. We get funding to just build toolkits to show people—hey, if you’re interested in this topic, what are some of the research questions and hypotheses you might have? And that leads you to looking for data that you actually need. Then, you have to uncover what types of new expertises you need in order to work with it.
So it’s definitely possible, but the existing space is so unnecessarily intimidating. People think you have to suffer through it, or you don’t “earn it.” I find that actually kicks out so many people who are really thoughtful and really brilliant. When I was in high school, for instance, I loved math—but my teacher told me I wasn’t good at it, and wouldn’t let me enroll in Calculus. And now I teach spatial statistics and publish on it! So again, there’s so many really interesting perspectives and lived experiences that we need in order to create better theories and use data in different ways.
“To actually make it meaningful to a variety of different audiences, you have to understand who your audiences are. Who’s going to be accessing this information? What are their goals? What are they searching for?”
Can you share more about how your team approaches improving the searchability and user-friendliness of complex datasets then?
We’ve found that making the process of working with data—developing web applications and analyses—yields really interesting results. We’ve been working on a toolkit focused on data about social determinants of health (SDOH) in place, and we’ve had a range of users engage with it. Nurses who have gone through the fellows program, community advocates, and even tenured professors have all worked with the toolkit on developing their own web application with SDOH data. Everyone finds some aspects easy and others difficult, but that makes the process worthwhile.
And then the product on the other end has to be really good, right? Sometimes, data scientists create applications that aren’t very user-friendly because they don’t consider art, design, or communication. So, it’s not just about making the data accessible but ensuring the entire process of working with it is intuitive.
One of the major challenges is that so much data is siloed within specialized fields, and people may not even know what to search for. To find something, you need the right search terms—but how do you know what those are? So with one of our new projects, the SDOH in Place Data Discovery Tool, we’ve gathered all this metadata—essentially, data about the data—to create a more intuitive search experience, similar to how Google searching works. We’re also building customized taxonomies to search for different topics.
For example, if you’re looking for data on parks and green spaces, you might not think to search for the Normalized Difference Vegetation Index (NDVI), a common satellite measure researchers use for that purpose. So we’re extending search capabilities and incorporating AI in ethical ways to encourage folks to enjoy that discovery process a bit more and actually bring different data together.
You mentioned the ethical use of AI. What do you think are some trends or emerging technologies that are going to have the biggest impact on geospatial analysis and data science in the coming decade? Are there things you’re personally excited about?
There are almost two branches of responses to this. So, the “correct answer” in GIS right now, is that everyone’s talking about GeoAI—how to leverage AI to extract more information from spatial data and uncover new insights. A lot of tasks that used to be manual can now be automated and done faster, and this has already made a big impact in geospatial science.
Personally, though, I’m really excited about a different segment: how AI can help us improve the actual theories about what we know. This includes things like knowledge graphs, ontologies, and semantics—that sort of thing, what some might call the “boring” parts of AI. There’s this need for more perspectives and voices. So I think that there’s a lot of interesting options for AI. Instead of looking at the data you get from the end of a project, what can we do to connect the knowledge we already have?
For example, in social determinants of health research, we could input our existing knowledge into AI systems to see if unexpected patterns or relationships emerge. This is using it more as an exploratory or hypothesis generation tool, right? And it can also be used to help, if folks have different ways of asking the same thing that we might not be able to predict—by recognizing how different disciplines frame similar concepts and synthesizing that information effectively. It’s not as glamorous, for sure, but it’s proving to be a pretty exciting process.
You highlighted the contrast between proprietary GIS systems and open-source, web-based approaches in geospatial research. What opportunities or challenges does this create, working with big datasets?
There’s a really interesting divide in the GIS world. A lot of research still happens within proprietary geographic systems, but there are also a lot of developers outside of GIS doing really exciting work. They’re using web-based tools, working directly with SQL databases, rendering spatial data in new ways with technologies like WebGL and KeplerGL, all that kind of stuff.
These two approaches rely on different file formats, which has led to some big differences in innovation. Proprietary systems tend to be more closed, so development moves slower, whereas the open-source space has introduced formats like Parquet and Apache Arrow, which allow for massive amounts of data. That completely transforms what you can do.
For example, when we built the U.S. COVID Atlas, our system loaded data directly into the web browser and converted it on the fly—eliminating the need for a traditional server backend. One of my colleagues was already working with that kind of setup, so he was able to spin up the first development version in just two weeks—something that would have taken months in a classic environment. I think that working with new types of data formats and moving to the open source landscape is totally changing what’s possible, which is really exciting.
Another exciting shift is—again—integrating AI into spatial data tools. For instance, that same colleague who kicked up the serverless Covid Atlas recently added an AI-driven component to KeplerGL to help guide users through data exploration, making suggestions for analysis so that you don’t have to read a thousand books before getting started. You can kind of get your feet wet right off the bat and then go from there.
While a lot of focus is on big data, I think there’s also a huge opportunity in what I call “small data.” You might not have as many records, but the metadata—the knowledge and information about each data set—is massive. So you can mine that information about the data, and that can lead to new insights. It’s a different way of thinking about it, but I think it’s exciting.
Learn More and Get Involved
If you’d like to learn more about Kolak’s research, follow her on Google Scholar. Contact Dr. Kolak here.
Contact the Office of Data Science Research if you’re aware of other people or resources we could profile here. ODSR is a campuswide convening organization that facilitates collaborations, resource sharing, and public engagement focused on data science research activities at the University of Illinois.