Young scientist helps design software that measures a surgeon’s skill

Amy Jin

Amy Jin, 18, is a recent high school graduate who loves hip-hop dance, the violin and English literature. But it’s her passion for computer science that has made her a superstar in the exploding field of artificial intelligence.

Jin has been intrigued by AI since the sixth grade, when students at The Harker School in San Jose, California, chose research projects that challenged them to show how they’d use computer programs to tackle real-world problems.

But her passion for the subject was ignited when she was a high school freshman and she heard an IBM scientist describe how the Watson supercomputer could help extend human capabilities in medicine and other fields through artificial intelligence, the ability to teach machines to “think” and “see.”

“That was really fascinating to me — that Watson could become like a second pair of eyes for a doctor,” the soft-spoken teen said in a recent interview. “I thought artificial intelligence was a really promising field, with so many cross-disciplinary connections.”

Since then she’s become part of a new generation of young science enthusiasts who are making waves in artificial intelligence, one of the hottest fields today in computer programming. The same technology behind the self-driving car, AI has the potential to change medical practice in myriad ways, from helping diagnose disease early to improving treatment and ensuring patient safety in the hospital and at home.

Over the past two years, Jin has worked with mentors at Stanford to produce a new software program that can measure a surgeon’s technical skill. It works by “watching” a video of a surgery and tracking the movement and timing of instruments used during a procedure.

The creation of the stunning piece of technology by Jin and researchers from Stanford’s medical and engineering schools won the top research prize at a major international scientific symposium on artificial intelligence where Jin presented it in December.

Arnold Milstein, MD, PhD, director of Stanford Medicine’s Clinical Excellence Research Center, predicts the approach will break new ground in objectively assessing clinicians’ manual skills in diverse clinical activities.

“This could make a big difference when manual skills matter,” said Milstein, a co-author of a paper describing the work. “It provides a path for tailoring the duration of surgical training to how quickly residents learn. And it opens the way to a more objective approach to periodically certifying a surgeon’s technical skill or alerting a surgeon when he or she needs a restorative break during a long procedure.”

The project grew out of a 6-year-old partnership between Milstein’s group and researchers at the Stanford Artificial Intelligence Lab, led by Fei-Fei Li, PhD, a professor of computer science at Stanford. The scientists are developing forms of AI to help ensure that best practices in health care are reliably applied. They initially focused on increasing staff adherence to patient safety protocols in intensive care units, improving hand hygiene in hospitals and monitoring frail seniors at home by assessing such things as how steady they are on their feet.

“Then one of the CERC fellows said, ‘I think we should do this with surgical skills,’” said Milstein. “The American Board of Surgery has long sought an objective test of surgeons’ technical skills.”

Serena Yeung, left and Jeff Jopling mentored Amy Jin
Serena Yeung, left and Jeff Jopling worked with  Amy Jin when she was a high school student to design software to assess the skills of surgeons. (Timothy Archibald photography)

Those hands-on skills are critical, said Stanford general surgery resident Jeff Jopling, MD, the former CERC scholar who proposed tracking surgical skills with computer technology. Jopling naturally gravitated to the project, as he had done graduate studies in Georgia in both engineering and medicine. He came to Stanford six years ago because he wanted to work with Milstein and Tom Krummel, MD, then chair of the Department of Surgery, to improve health care systems around the country.

Safety issues in health care became a focal point after the National Academy of Medicine issued its 1999 report on the high rate of deaths and disability that resulted from human errors in medicine. Afterward clinicians tried to minimize preventable complications with solutions such as surgical safety checklists, a series of detailed steps for clinicians to follow to help avoid mistakes, Jopling said.

Then a 2013 study of 20 bariatric surgeons in Michigan highlighted a missing variable in the picture: surgeon proficiency. The study, published in The New England Journal of Medicine, showed that if a surgeon did well — as measured through blind ratings by peers of videos of surgeons’ hand movements — so did the patient; if the surgeon faltered, the patient was more likely to suffer complications, undergo repeat operations and have emergency room visits.

“Until then, there had been so much focus on improving the system, but here it showed that people and their skills matter, too,” said Jopling, one of the authors of the latest AI paper with Jin.

Yet, in the course of their training, surgeons are sometimes unable to get a good sense of how they are performing, he said.

“Even when I do the 1,000 operations for my training, I get very little feedback on most of those surgeries,” Jopling said. “I was surprised by that as a trainee. I thought it would be like a sport or music, where you have a coach saying, ‘Do this. Don’t do this.’ Exceptional teachers provide that, but not everyone does. Not everyone can explain what you are doing well or not doing well.”

While Jopling was mulling the new surgery project, Amy Jin was busy adjusting to the demands of high school. The second child of Chinese immigrants, both PhDs in physics, she had long been keen on computer science and was already a whiz at math, but she had never done any programming.

So as a freshman she signed up for an AP computer science class and joined the school’s Women in Science, Technology, Engineering and Math Club (she later became the club’s president). There, she heard about an opportunity at the Stanford Artificial Intelligence Lab’s Outreach Summer Program, which is designed to entice young women into science careers.

“Even when I do the 1,000 operations for my training, I get very little feedback on most of those surgeries. I was surprised by that as a trainee. I thought it would be like a sport or music, where you have a coach saying, ‘Do this. Don’t do this.’ Exceptional teachers provide that, but not everyone does. Not everyone can explain what you are doing well or not doing well.”

In the program, she was paired with Serena Yeung, PhD, then an up-and-coming doctoral student, who mentored her. Yeung is also the daughter of Chinese immigrants, and the two shared a passion for science and a desire to help others. Yeung had long been interested in medicine — her father is a family physician — but as a Stanford undergrad she realized she was an engineer at heart. She became immersed in AI, doing internships in the field at Facebook and Google. While searching for a doctoral project, she met Milstein and became captivated by the idea of using the technology to improve medical practice.

Yeung introduced Jin to one of the group’s AI in medicine projects — a hand-sanitizing initiative designed to control the spread of infection, a significant problem among hospitalized patients. For the project, Yeung, Jopling and colleagues at CERC, the Department of Pediatrics and the AI lab received permission to install depth and thermal sensors outside a transplant unit at Lucile Packard Children’s Hospital Stanford, where hand hygiene dispensers are located. They used AI to program the sensors to monitor personnel — shown only as outlines of human shapes to protect their privacy — as they passed by the dispensers.

Their algorithm was able to predict with more than 95 percent accuracy whether staff members were using proper hand hygiene, the researchers reported at the Machine Learning for Healthcare Conference in 2017. They are now using the algorithm to measure hand-hygiene compliance in other hospitals and see whether interventions, such as real-time alerts, can improve these practices, said Yeung, who is expected to join the Stanford faculty in early 2019.

Jin was enthralled by her work on the project and was eager to learn and do more. Yeung figured the budding surgery project was the perfect new opportunity for her.

“We could scope it to a level that Amy could start with. Obviously, she surpassed all of our expectations,” Yeung laughed. “It became much more than a high school project, which was great.”

Jin fit in the work between a demanding school schedule, club meetings, and orchestra and dance rehearsals. She audited a Stanford undergraduate course in computer vision to learn more about how to train computers to “see” and understand the visual world, with Yeung coaching her through. On her own, Jin dug up dozens of related studies in the medical and computer science literature, which she shared with the team.

Jopling took her under his wing to introduce her to the world of surgery. He showed her laparoscopic surgical techniques in the Goodman Surgical Education Center at Stanford Hospital.

The trio met every other week, and sometimes more often, at the Stanford Artificial Intelligence Lab across the road from the university’s medical center. The glass-walled laboratory is a hive of activity, as dozens of hoodie-wearing students peer intently at screens displaying colorful computer code and then discuss problems, often well into the night. The three researchers also frequently texted and emailed each other, as Jin was dependent on her mother to drive her to meetings.

The challenge of the project, which was officially launched in the summer of 2016, was to “teach” the computer to recognize and follow the path of surgical tools as the clinician guided them through the body. This is a form of object detection, a field that has been rapidly advancing in recent years, in part because of contributions from Li’s lab.

Identifying data points

The method involves developing an algorithm that teaches the computer to learn as it is fed thousands of data points. With each bit of data, the computer gradually adjusts until it reaches a stage where it can form an accurate picture of the object — in this case, a surgical tool. The process is enabled by the growing ability of computers to rapidly digest vast amounts of data. Jin refined some of the techniques of object detection to apply it to surgery, Yeung said.

“The general idea was that if we are able to track and recognize instruments in videos, we would be better able to analyze tool usage patterns and movements,” something that Jin said has been shown to be an effective building block for measuring and assessing a surgeon’s skill.

For simplicity, the researchers focused on gallbladder-removal surgery because it is a common, standardized procedure that typically uses seven instruments at most, including clippers, graspers and scissors. They obtained 15 videos of procedures done at the University Hospital of Strasbourg and labeled some 2,500 individual frames, attaching a value to each one so the computer could build a visual picture of the tools and locate them within the surgical field.

They used metrics to track the timing of tools — which instrument was used when, and for how long — and produced maps of the pathway of each tool. In addition, they created heat maps that showed how far the tools ranged within the surgical field, as better surgeons tend to handle instruments in a focused area.

“With that, we could gain a sense of a surgeon’s performance overall,” Jin said over iced tea at a Starbucks near her home.

From the visuals and statistics, the researchers were able to gauge multiple aspects of the clinicians’ performance, including their economy of motion, how often they switched back and forth between instruments, and their efficiency at each step of the procedure. They then asked three Stanford surgeons to watch the videos independently and rate the surgeons on a scale of 1 to 5, based on widely accepted criteria: their efficiency, their dexterity with both hands, their depth perception and their handling of the tissue.

“The insights into how the machine rated the different surgeries correlated with the surgeons’ insights into how they rated the videos,” Yeung said.

For instance, there is a critical step in a gallbladder-removal surgery where the clinician has to clip and cut both the cystic artery, which supplies blood to the organ, and the cystic duct, which carries bile in and out of it. When done properly, this step prevents bleeding and leakage of bile during and after the procedure. If the clips are in the wrong place or come loose, the patient can suffer devastating complications, including damage to the bile duct.

“We could scope it to a level that Amy could start with. Obviously, she surpassed all of our expectations. It became much more than a high school project, which was great.”

A good surgeon does this efficiently, with economy of motion. In one case, a videotaped procedure showed the deft skill of the surgeon, with the clipper and grasper placed just right. Another video showed a surgeon struggling to put an extra clip in place, then later taking some time to pry it loose. The computer detected the discrepancy in skill levels by viewing not only the placement and the pathway of the implements, but also the elapsed time of the procedure.

With the analysis in hand, the group submitted their results to the Workshop on Machine Learning for Health, part of the conference on Neural Information Processing Systems in December 2017 in Long Beach, California. The conference is one of the biggest AI meetings in the world, involving 7,000 researchers, graduate students and industry professionals. Yeung listed Jin as the first author of the paper, an extraordinarily generous move on her part considering doctoral candidates eager for publication credits typically claim this spot themselves, Milstein said.

In the workshop, the paper was selected from more than 120 submissions as one of 10 worthy of a spotlight talk. Jin, attending her first-ever conference, presented her work to the distinguished audience and it was then published in the conference proceedings.

When the choice of best paper was announced, Jin was casually scrolling through her laptop, barely listening since she didn’t expect she’d know the authors.

She was stunned when her name was called. “I was just kind of half there and was really surprised,” she said, her face lighting up with a smile. She immediately sent a text message to Yeung and Jopling, who said it was a surreal moment.

Jopling called Jin an “inspiration to all of us,” and Yeung marveled that a high school student had both submitted a paper to the conference and won the top award.

So how did Jin do it? “It’s definitely a combination of luck and opportunity, I guess,” she said. And hard work? “Hard work, yeah, from everyone,” she laughed.

Refining the tools

Jopling said the next step in the project is to amass as many as 1,000 videos recorded from several different surgeries. The Stanford researchers will collaborate with colleagues at the Utah-based Intermountain Healthcare, a 22-hospital system with large surgical volume, to analyze the videos and refine the evaluation tool. The future work will take into consideration the complexity of surgical cases, as some gallbladder removals, for instance, may be quite straightforward, while others might be more challenging because of a patient’s multiple medical problems, Jopling said.

He said the technology will be particularly helpful in surgical training, noting that it’s labor-intensive for a surgeon to sit for hours and review the videotaped performance of a trainee. The automated system could do this for them, and could alert surgeons, in real time during a procedure, if they are starting to lose their edge, said Milstein.

“It’s really important for the patient’s outcome to know when those moments of fatigue and deterioration set in,” he said. “Knowing when it’s time for a lead surgeon to take a break and allow the assistant to take over is analogous to a baseball coach deciding when a dip in accuracy and pitching speed indicates that a pitcher needs relief.”

Milstein has shared the work with Mary Hawn, MD, professor and chair of surgery at Stanford, who was also enthusiastic about presenting the model, once it’s perfected, to the American Board of Surgery as a possible addition to current board certification exams.

However, not all surgeons are enthused about the idea of having a machine second-guess their skills, Jopling said.

“I had one surgeon tell me, ‘When that day comes, that’s the day I’m going to retire,’” he said. But, Jopling added, “There are always things you can work on and improve. It’s like having a tennis coach that watches every single swing you take over the course of your career, but without blinking or getting tired — there is always something you can improve, but in surgery, you often don’t get that feedback.”

The AI technology could have broader applications in many aspects of medicine, Yeung noted. For instance, the group has been testing it to monitor the movements of patients in intensive care units — when, for example, they get in and out of bed or a chair — and to ensure that caregivers are following steps to keep patients safe. The technology is also being tested to monitor frail seniors at home, measuring their activities and mobility, and alerting others to a fall or other mishap that requires immediate attention.

“Clinicians, nurses and other health care providers are so overwhelmed now, and the problem is going to get worse as the baby boomer generation gets older,” Yeung said. “I think AI has great potential to provide an untiring, constant awareness of what is happening, which can be used to assist health care providers and prevent cognitive overload.”

But, the work will go on at Stanford without Jin, who is now a freshman at Harvard, following the path of her brother, also a science whiz, who is a senior there. She said it was hard to say goodbye to her Stanford coaches after two years of intensive work, but she’s excited — and a bit nervous — about what may come next. Although she has not settled on a major, it’s no surprise that she’s considering computer science. And that, said Yeung, could be a boon to the profession.

“It’s great to have people like Amy who excel in computer science,” Yeung said. “It’s one of the problems AI is trying to address — that we don’t have enough women in the field and the number decreases at every stage. So we hope Amy will continue in the field and be a good role model for others.”

Author headshot

Ruthann Richter

Ruthann Richter is a freelance science writer. Contact her at

Email the author