Honor Roll Lab Talk

Artie Shen on Interpreting AI, Learning New Hypotheses, and ‘Fortune Telling’

Yiqiu “Artie” Shen, machine learning researcher who develops artificial intelligence systems for medical imaging, talks about AI’s ability to explain itself, guide discovery, and predict cancer risk.

Yiqiu Shen, PhD, is an assistant professor of radiology at NYU Grossman School of Medicine and scientist at the Center for Advanced Imaging Innovation and Research. Dr. Shen, who goes by Artie, joined NYU Grossman in the fall of 2023 after earning a doctorate from NYU Center for Data Science. His doctoral research—advised by Kyunghyun Cho, PhD, and Krzysztof Geras, PhD—was conducted in collaboration with investigators at NYU Langone Health and focused on the development of explainable deep learning systems for medical image analysis.

Deep learning holds remarkable promise for diagnostic radiology but the technology’s clinical adoption is slowed by two key challenges: AI’s reliance on vast amounts of manually labeled data, which are costly to create; and the black-box nature of AI, which makes it hard for radiologists to trust a model’s findings.

To address these issues, Dr. Shen and colleagues have proposed novel deep learning models that learn without manual annotations and provide a measure of explanation for their diagnoses by highlighting the areas of a medical image that contribute the most to the model’s findings. These investigations resulted in artificial intelligence systems shown to successfully predict the risk of deterioration in patients with Covid-19 from X-rays and AIs that accurately detect breast cancer on mammograms and ultrasound.

In April, New York University distinguished Dr. Shen’s doctoral research with a 2024 outstanding dissertation award in the public health and allied health category. Our conversation was edited for clarity and concision.

What do you consider your main area of research? 

My main area of research is to develop interpretable AI systems that can help clinicians in decision-making regarding treatment, diagnosis, prognosis, and patient management. Interpretability is very important. Radiologists, surgeons, and clinicians working in different service lines say that current AI systems are either not very flexible and can only do very programmatic tasks or are flexible but do not offer an explanation that people can understand. 

Can you give an example of a clinical system powered by artificial intelligence that is very good at programmatic tasks? 

Let’s take mammogram interpretation. A mammogram is a low-dose X-ray image of the breast, where radiologists look for signs of cancer. For 10-20 years now, computer aided diagnosis (CAD) systems have been using very straightforward logic to help with things like segmenting images of breast tissue and recognizing masses or calcifications. But the tricky part is whether a mass is benign or malignant, and what kind of patient care should be pursued with respect to a particular finding—those are the tasks that the CAD systems cannot do.

What about computer programs that radiologists currently work with that do have the flexibility to handle complex tasks but aren’t interpretable? Can you give an example of such a system?

There are many commercial AI models that have been integrated into radiology analytical frameworks, and these systems offer a risk score to tell a radiologist how likely a specific mass is to be an early sign of cancer. But what it means to have a score of, for example, 70 percent, is not clear. Also not clear is the rationale for why a particular finding represents a given score. That’s what I mean by calling a system flexible but not interpretable: it can propose a cancer diagnosis but gives a score without any explanation.

In your doctoral work, how did you approach interpretability, and why did you choose to do it the way you did?

Interpretability is a very high-level word and has different meanings in different applications. For me, interpretability starts on the user end. In the second year of my PhD study, I shadowed radiologists and observed their clinical workflow. They sit in front of a giant screen, exams pop up from a list, and breast radiologists have to go through that list very quickly. Most of the exams are fine, but in cases where there are signs of abnormality, the reader has to stop and look through the entire breast. Breast images have very high resolution and searching for all the signs that could indicate malignancy is very, very time consuming. So, I’m thinking: what is the best way we can convince a radiologist that this is a case that they should stop and look through?

And of course the best way to do this is to directly highlight the suspicious area. If that area is deemed benign by the radiologist, they can de-escalate the case. But if there are concerns, maybe a radiologist can draft a report according to the hint given by the AI. That’s how I envision an AI model that can contribute non-trivial value to the clinical workflow, and that’s how I developed a more concrete vision of interpretability. 

When I looked through the literature, I saw that people had already done similar things with natural images, so I thought why not try to borrow, modify, and make it work for medical imaging? 

When you say there were already similar technologies, do you mean networks that can create heatmaps or show where their attention is focused in identification tasks? 

Yes, there’s a line of work called weakly supervised object detection, going back to about 2016, that aims at building networks able to highlight objects on an image that correlate with or support the network’s prediction. People developed this feature to see whether the model is doing something sensible. 

In medical imaging, we can go beyond this and use our domain knowledge to design regularization measures that encourage the model to highlight objects that look similar to how we know cancer appears. It’s a way of increasing a model’s interpretability that isn’t usually done in natural-image classification.

Natural images are quite different from medical images, and medical images vary by modality—mammography, MRI, ultrasound, and so on. In your doctoral dissertation, you mention that a cancer finding on a mammogram can take up just 100 pixels in an image as large as 10 million pixels. That’s one thousandth of one percent of the total image area. How do you deal with something like that? What technical challenges did that present you with and how did you overcome them? 

In terms of the technical challenge, there’s of course a limit to computational resources. A GPU has a memory that allows us to process data of a certain size, and the conventional computer vision models typically handle images that are only one to ten percent the size of mammograms.

One common strategy to deal with high-resolution images is to downsample them. But if you downsample a medical image by a factor of one hundred, then suddenly a lesion becomes one pixel, and you can’t see it. To approach these issues, I proposed a neural network called “globally aware multiple instance classifier” or GMIC [pronounced gimmick]. 

What GMIC does is it first uses a very low-capacity module—not very precise but highly memory-efficient—to get a general impression of where in the image are the areas of interest. And then the model focuses its computational power—a high-capacity network that requires high memory consumption—on only the select areas. In my original work, those areas are six patches, each 256 by 256 pixels, a little over one percent of the area of the whole image. The idea is to fully utilize the available power without missing small details.

Was observing radiologists and learning about their process an inspiration to figuring out how the neural network should function?

Definitely talking to clinicians makes me better informed about how to approach the problem in a sensible way. But the way a neural network learns is very different from how humans are educated. Humans are usually provided with several well-annotated cases, where you have the image, the cancer, and also explanations. A mentor or an attending will teach the residents why something is suspicious and will give examples of similar cases. But neural networks—at least the networks I worked with during my PhD—do not have the power to use fine-grained guidance or annotation provided by experts. They’re trained more mechanically. 

Neural networks make up for lower ability to generalize from few instances and the inability to reason from explanations by learning from a huge number of examples. In radiology, annotating vast tranches of images is prohibitively expensive—this is another limitation your work has aimed at overcoming. Can you explain what kinds of information the networks you’ve designed learn from?

Traditionally, if you want to train a model to do cancer localization, you have to provide high-quality annotations created by radiologists who go through the images to pinpoint a cancer—a process that is very time consuming. This is strongly supervised learning: you’re essentially presenting the model with a label that matches the task you want the model to do. Instead, weakly supervised models are able to learn from a binary indicator of whether a cancer is present. 

We built automatic pipelines that extract from pathology reports—which tell you whether a cancer was confirmed in a patient—the phrases that indicate the presence of cancer, and then we mapped those back to the images, so we can train a model on automatically collected labels: cancer versus no cancer. The model I built is able to utilize this weaker level of information and, by learning the common patterns shared among the images labeled as cancer-positive, it’s also able to pinpoint the cancer’s location. That’s how my work can be applied to reduce the reliance on human-provided labels.

Can you talk about multimodality and multimodal networks? 

Multimodality means trying to utilize text, imaging, categorical variables, and other data that have different formats, different representations, and that embody different kinds of information. One of the things I’m undertaking now is building a model to efficiently use complementary information from all kinds of different sources to try to tell whether a healthy patient—a patient who is currently cancer free—will develop cancer in, say, a five- or ten-year period. This is a very difficult task for human experts because the patients of interest technically do not have cancer and radiologists are trained to detect—not to predict—cancer. And if you consider a longer period, for example 20 years, then this is almost like fortune telling.

Your and your colleagues’ work has shown that breast cancer findings made by deep learning models, including models that outperform trained radiologists, become even more accurate when combined with the diagnoses made by human readers.

Yes, I built a deep learning model for breast ultrasound, and we showed that if the predictions of this breast ultrasound AI model can be combined with human radiologists’ diagnoses, we are able to reduce false positive findings by about 36 percent while not missing additional cancer.

In that work, the hybrid of human and model findings is a theoretical construct. Have you done any studies in which radiologists are presented with model information in real time to consult it alongside the images they’re reading?

This is exactly what we’re planning to do with the multimodal AI system: first, show the reader the image and let the reader make a diagnosis; then, show the AI prediction along with supporting evidence; and then, give the reader a chance to modify the diagnosis. Our hope is that through this study we can realistically assess the impact of AI that’s actually cooperating with the radiologist.

As you said earlier, experts tend to distrust cancer risk metrics served without an explanation. For radiology, an area of medicine that mostly deals with qualitative evaluation of images, things like heat maps and highlighted areas of interest seem to be well within the logic of the discipline. But what about verbal reasoning? We now have large language models (LLMs) and generative AIs able to create images based on text prompts. The kinds of models built for cancer detection are very different—not generative—but what is your thinking about the potential for language models to create verbal reasoning about the findings or the inner workings of diagnostic models?

I think this is exactly the next step of interpretability research in general, especially in the medical domain. There’s some evidence that if you train an LLM well, it’s able to reason using medical concepts. When trained using medical data, LLMs are able to achieve a higher-than-passing score on a qualifying exam for MD candidates. Another promising fact is that LLMs are more and more instruction-following, meaning that you can ask for explanations and LLMs will provide them. Potentially—with certain communication mechanisms between a diagnostic AI and an LLM—we may be able to utilize an AI model’s capacity to find cancer and an LLM’s capacity to interpret the intermediate representations generated by the diagnosis model and turn those intermediate representations into words. 

Of course, the biggest difficulty here is that LLMs tend to hallucinate—they tend to make up things that sound very reasonable. So, what if an AI model’s diagnosis is wrong but the LLM somehow comes up with very convincing illusional evidence or explanation? That’s very dangerous, so the tendency to hallucinate should be addressed before we can proceed—that’s one thing. Another thing is that whenever we’re talking about AI, the limitation is always the data, and currently we do not have that many datasets with both the diagnosis and the rationale. A radiology report is actually very different from an explanation. The report can describe a concerning finding but doesn’t tell you why the observation could be correlated with a cancer. So, automatically generating high-quality and reliable explanations is a difficulty that we’ll have to overcome in building verbally explainable models for medical image analysis. 

You did your PhD at NYU Center for Data Science in the medical track. There was a number of alternative tracks you could have undertaken. What led you down the path of AI for medicine? 

I joined the Center for Data Science in 2016 as a master’s student. Before that, I worked at a finance company as a software engineer—that was intellectually challenging but I felt it lacked social impact. I decided to pursue graduate education in order have more ability to make such an impact. In my application statement I wrote that I wanted to do machine learning to improve some aspect of society. I didn’t have a very clear vision then, but after two years at NYU, I wanted to go with medicine because NYU has a very good medical school, so I felt like it was natural to use this advantage. 

Another, more personal motivation is that my mom is a doctor in China—she’s in the hematology department, helping people with blood diseases, so I grew up in an environment where I was educated to think about the impact of my work on others and developed a curiosity about the human body and natural science. I feel that AI for medicine can fulfill both of my, say, requirements for a good research direction: the first is social impact; the second is to satisfy my curiosity. 

In natural science, we’re decoding complicated laws set by mother nature, but we can only come up with and test a limited set of hypotheses. With AI and the data we have accumulated, we are able to learn more complicated rules, and if we are also able to let the AI explain those rules, it can help us improve our understanding of nature. 

You’re talking about a role for AI that is much bigger than just helping human experts perform a task better or faster or more efficiently, which is what a lot of AI research—including yours—is focused on. Can you talk more about what AI could mean for discovery and why weakly supervised and unsupervised methods are important to learning new things?

Back in the 17th century, you’d have clever physicists who would collect some data and come up with an equation that describes this data. But the equations that we can come up with are somewhat limited. Deep learning models are able to automatically learn rules from patterns, however complicated. That’s why AI is a promising tool, because it is able to increase the complexity of the theory we can build on. But the missing piece here is that currently the flexible models are less interpretable, so we do not know what they’re thinking. 

Technologies like weakly supervised learning, unsupervised learning, and self-supervised learning have one thing in common: they give AI the freedom to learn without strict guidance from humans. And so, what we need to do is provide AI with more objective facts, for example: this is an image, this is a list of symptoms. But the connection between the two could be up to AI to learn. That’s something I’m very interested in, and there are many, many applications. 

As you embark on a new role as assistant professor, what are some of the things you’re looking forward to?

My high-level goal as faculty is the same as in my PhD training: to produce high-impact work and to improve our clinical workflow. I also want to expand my research to other realms such as medical image reconstruction, prognosis, treatment planning, algorithmic fairness—all those topics are very interesting to me and we have the resources at NYU Langone to explore them. 

I feel a great duty to pass on my knowledge to a new generation of trainees and students, and being a mentor is also about helping people grow in a positive way both in a career path in life—that’s another objective I have.

Related Stories