Heading to Fenway Park for a late-season Red Sox-Yankees confrontation is a tough gig, but someone has to do it. So I did.
The business reason was to get a look at how MLB Advanced Media (MLBAM) gathers all the data it uses for its Statcast service. Statcast can track the actual distance of home runs from contact to where the ball would have landed if it could hit the ground without hitting a wall or the stands first. Also, such previously unknown tidbits as first step (how long it takes a fielder to react to a batted ball), the fielder’s arm strength (basically how fast, strong the throw is), and exchange time (how long it takes him to transfer the ball from his glove hand to throwing hand).
In the past, home run distance was largely guesswork (and the subject of heated debates.) Statcast can offer some definitive answers based on actual physics—this year’s longest home run, for example, was the 493-footer crushed by the Nationals’ Michael Taylor off Rockies pitcher Yohan Flande in August. Purists will likely point out that it was probably aided by the thin air at Denver’s Coors Field, so I’m guessing that the era of asterisks is not over, even with the more accurate data. Another fun fact: Marlins’ phenom Giancarlo Stanton is responsible for the hardest hit ball this season with an RBI double against the Phillies that left his bat at 120 mph.
More from FORTUNE
MLBAM applies technology including Trackman radar to measure speeds of pitches, exit velocity of batted balls, ChyronHego CHYR to track player position and movement on the field, and Amazon AMZN Web Services S3 and other services to store and handle that information. The radar, sitting above and behind home plate is mostly useful for measuring the speed of balls flying to or away from it.
“Throws from third to first may not be picked up [by radar], so we can use optical tracking to pick those up,” said Greg Cain, senior director of sports data for MLBAM, who was my tour guide.
Each game generates roughly 7 terabytes of uncompressed data or 80 gigabytes compressed, Cain said. Speaking at the AWS Re:invent confab last November, MLBAM CTO Joe Inzerillo said, at 2,430 games per year, that’s 17 petabytes of raw data annually. Inzerillo’s demo used the seemingly impossible double play initiated by San Francisco Giants’ Joe Panik as proof positive that players should never, ever slide into first.
MLBAM makes extensive use of AWS S3 storage service, as well as RedShift data warehousing and Kinesis real-time data analytics and is interested in seeing what Amazon’s Machine Learning (ML) service can bring to the table in terms of building and refining models.
The thing is with any big data crunching projects: getting and handling the data is one issue, knowing what questions to ask about it is another, and deciding how much of that data to expose to viewers is yet another. That intersection of computer smarts and human expertise is critical.
In the MLB Network TV van during the second inning the game, MLB segment producer Mike Treanor and staff watched a battery of screens to see which of the myriad stats generated should show up on the broadcast.
I watched them watch the screens as Yankees (and former Red Sox) second baseman Steven Drew hit a double in the fifth to put the Yanks ahead. The system spit out 16 stats pertinent to that play but Treanor’s call: “I just need hit velocity. It’s the biggest hit of the game.”
And then they moved along.
As many insights as Statcast has provided, Cain, like most in the data science realm, remains intrigued by what hasn’t been explored yet. What models have yet to be built? What is the likelihood that any pitched ball will be hit, and how far, and to which field?
Said Cain, “It all comes down to: What are the questions we haven’t thought of yet and how do we structure them?”
If you haven’t seen Statcast in action on the MLB Network, Fox, or TBS, check out the video below of Red Sox infielder Brock Holt’s double-play and associated metrics.
Subscribe to Data Sheet, Fortune’s daily newsletter on the business of technology.