data is shitty

when there are suddenly a bunch of service points in the middle of the Atlantic, that means something has stripped out lat-long data, not that someone is building a sovereign island

                March 13, 2025

            data is shitty

            Thinking
I'm reading through Benjamin Breen's latest newsletter which is, as often is the case, about AI. Previously, he's written about how he's used AI tools to help in his research process, often as an interface for dealing with difficult historical sources (parsing old handwriting, as a translation aide, as a first-pass search or synthesis tool) and as an tool for teaching with, by building up necessarily incomplete and incorrect but nonetheless evocative interactive scenarios for students to play with. This post is likely to be less controversial than some of those (one of which went by the title "The leading AI models are now good historians" with the caveat "...in specific domains" hidden in the subhed).
Anyway, Breen's argument here was that, at least in the medium term, AI is not going to replace historians because doing good history relies on physical presence (to archives), intuition that is resistant to explaining, and illegible data. That last one really stuck with me, because my pre-grad school career was based almost entirely around making data legible. I spent four summers and then five years up to my elbows in data, making it play nicely with software. On my resume, that looks like working in tech. But getting dirty in data often felt like the plumbing of the programming world: someone has to actually make the shit go through the pipes, and the fancy architects aren't the ones doing it. 
I assume LLMs can handle shittier data than many other systems, particularly when things are missing or have outlying aberrations, since LLMs and other machine learning sytems are all about extrapolating based on prior patterns. But in my experience, dealing with real life data requires a lot of intuition and tolerance of mess. It's knowing things like when there are suddenly a bunch of service points in the middle of the Atlantic, that means something has stripped out lat-long data, not that someone is building a sovereign island. It means knowing things like when you see a bunch of entries where birth year is 1875 it probably means there's a default date filled in when data is missing, not people pretending to be 150 years old (COUGH ELON COUGH). It means noticing when files feel weirdly small, or the load time is weirdly short. It even means figuring out what on earth has happened when your data looks alarmingly well-behaved. People aren't well-behaved! Data isn't well-behaved! It's ugly as shit, and always in new and fun ways. 
I was involved in recruiting a bit for one of the data-centric jobs I worked in before grad school, and one of the things we talked about a lot was all of the other skills and experience we cared about more than a computer science degree. We had historians and physicists and a good number of mathematicians. Most of us had taught ourselves to code. The CS grads were good at explaining an algorithm on a whiteboard, but often froze when data didn't behave as expected.¹ We weren't even working with messy qualitative data: we worked with utility companies to ingest their usage data. Very qualitative. IDs, locations, dwelling information, amounts of energy, timestamps. AND YET, every new client brought newly bizarre horrors in that data. The first step to fixing those problems was to essentially become a detective and follow your intuition to find clues about out how the clients' systems were mangling the supposedly-straightforward numbers. It involved looking at lots of numbers in CSV files and querying databases and fighting with text encoding to tell a story about what was happening.
Smoothing away the corner cases allows LLMs to generate freakishly perfect slop. It's what allows them to write perfectly bland essays. But, like Breen pointed out, the tech is fundamentally ill-suited for noticing the one weird piece of correspondence, or the unusual pattern of missing documents, or the heart-wrenching story that your eye just happened to catch on as you flipped through an old diary. He said:

"The issue is that generative AI systems don’t want messy perspective jumps. They want the median, the average, the most widely-approved of viewpoint on an issue, a kind of soft-focus perspective that is the exact opposite of how a good historian should be thinking."

Being a historian is truly not that different from working with data in this way. (Or from doing advanced mathematics, for that matter). All of them require this sense of intuition that gets built up and fine-tuned over time, that helps the brain wade through piles and piles of stuff in order to build an argument or tell a compelling story (usually both).
I also finally read through one of Dave Karpf's newsletters from last year, also about AI and data.² He observed, using some really helpful explanation from computer scientists Arvind Narayanan and Sayash Kapoor's recent book AI Snake Oil, that "predictive AI is just a rebrand of the Big Data hype bubble from 10-15 years ago," (god, yes, thank you) and crucially, that during and after that bubble we learned a lot of things about how data is shit. Things that the AI hype bros completely ignore while covering their ears and shouting about the singularity. 
Karpf pointed to his 2016 book that argued that political strategists can and should pay attention to data—things like email analytics and polling—and should also strongly resist the siren song to cede control of their strategy to blindly trusting whatever the data tells them instead of crafting a cogent and principled vision. He worried that the meteoric rise of AI in everything would make this risk even worse, resulting in more and more organizations abdicating their strategic power to bland, unaccountable LLMs. 
The Democrats' 2024 campaign and the wanton maniacal destruction that is DOGE both seem to be clear examples of this type of abdication: no coherent plan, just tacking to the winds of polling and glorified Grok-powered find-and-replace. It's truly nauseating. And the threat of applying the same lazy playbook to deportations is horrific. 
As Breen put it:

"Generative AI has made it absurdly easy to generate a lot of text or images. But it hasn’t made us any better at subtracting the useful, meaningful, or simply interesting stuff from text and images — isolating the gold from the fool’s gold."

Chasing fool's gold seems like a spot-on metaphor for the entire political situation we've gotten ourselves into. It's greed and malicious grift all the way down. 
Reading
After Trump was elected the first time, I read a bunch of depressing books about the Holocaust. This time, I went closer to home, and just finished the audiobook for America's Deadliest Election which purports to be about electoral politics but is actually about the end of Reconstruction starting in Louisiana and the descent of the US South (back) into fascism. It's an okay book—it's written by a news anchor and really, truly, sounds like it—and is a really horrifying and fascinating story worth reading about. I have more thoughts I might write about more another time, mostly circling around the sources of political legitimacy.
Doing
Purim starts tonight, so this afternoon I'll be making hamantaschen! I'm thinking poppyseed and guava/cheese again this year. Maybe a couple jams. 

Here is where I mention that I have a MS in computer science and if anything, it made me more skeptical of using CS degrees as primary qualifications. ↩

His newsletters are always worth reading, even if they clutter up my inbox while I wait to get around to it. ↩

Don't miss what's next. Subscribe to Keyed In: