On July 8, 2013, at 11:20 a.m., Olivia Munn hailed a taxi on Varick Street in Manhattan’s West Village. The actress took an 11-minute ride across the island, to the Bowery Hotel, for which she paid $6.50.
Later that day, at 7:34 p.m., Bradley Cooper caught a cab in Tribeca, outside the Greenwich Hotel, and arrived ten minutes later on Bank Street, just a few blocks north of where Munn had embarked eight hours prior, and paid $9.00 in fare. Both Munn and Cooper paid in cash; it’s unclear how much they tipped.*
I gathered this information not by following Munn or Cooper around, or hearing about it from someone else, but from an enormous database of every cab ride taken in New York City in 2013. Assembled annually by the city’s Taxi and Limousine Commission, the database catalogs the times, geographic coordinates, fares, and tips of approximately 173 million individual rides. Here’s a sample:
City officials initially released the database on March 18 to a New York City data analyst named Chris Whong, who requested the information under New York State’s Freedom of Information Law. The database is so large—nearly 50 gigabytes, when uncompressed—that Whong had to buy a hard drive and drop it off at the taxi commission’s headquarters in Manhattan’s Financial District. He later uploaded the database for others to download, either directly and or via BitTorrent, and created an absorbing interactive data visualization app called NYC Taxis: A Day in the Life that animated the 24-hour movements of each cab.
Though the data released to Whong was comprehensive and specific, city officials had attempted to anonymize certain identifying details associated with every ride—namely the medallion number, an alphanumeric code assigned to each taxi cab in operation, and the hack license number, which is assigned to drivers authorized to operate a yellow taxi. But they did so, a software developer named Vijay Pandurangan had discovered, by running both sets of numbers through a notoriously weak cryptographic algorithm known as MD5.
To simplify things a bit: MD5 works by taking any input, like a string of text, and outputting a 32-character alphanumeric string (e.g., CFCD208495D565EF66E7DFF9F98764DA). “It’s pretty hard to figure out what the input was as long as you don’t know anything about what the input might look like,” Pandurangan explained in a Medium post on June 21. “...The problem, however, is that in this case we know a lot about what the inputs look like.”
New York City medallion numbers, Pandurangan realized, follow one of three very particular formats:
- One number, one letter, two numbers. For example: 5X55
- Two letters, three numbers. For example: XX555
- Three letters, three numbers. For example: XXX555
Hack license numbers are even more uniform: Either 6 digits long, beginning with any number, or 7 digits long, always beginning with the number 5. This meant Pandurangan was able to completely reverse the city’s efforts the anonymize both sets of numbers. (Gawker reader Michael discusses this process in a bit more detail below.) A Google software engineer named Jason Hall later uploaded the de-anonymized version of Whong’s dataset to Google’s BigQuery service, allowing anybody on the Internet to quickly find rides associated with a particular medallion or license number.
The implications of Pandurangan’s discovery were initially unclear. Nobody was quite able to demonstrate a specific, vivid case where the taxi data revealed what might be considered truly private information. After all, taxis operate in public, and the database didn’t supply the identities—anonymized or otherwise—of their passengers. Still, users on Reddit, Hacker News, and elsewhere continued to investigate the trove of data.
Three months later, on September 23, a Northwestern graduate student named Anthony Tockar documented at least two cases in which the database did in fact reveal, or at least confirm, passenger data. And not just any passengers, but two very famous ones: Bradley Cooper and Jessica Alba.
Tockar had realized that paparazzi photographers in New York City frequently capture celebrities entering or exiting yellow taxi cabs, and that many of their pictures depicted the cab’s unique medallion number. After all, the number is prominently displayed on the car’s exterior: In lit letters on top, in black paint on the side, and on both license plates. You can spot the cab’s medallion number in every photograph in this post.
Searching Google Images for “celebrities in taxis in Manhattan in 2013,” Tockar came across timestamped photos of Bradley Cooper and Jessica Alba entering or exiting a cab. Using the photos’ timestamps and accompanying descriptions to establish where they were taken, Tockar was then able to determine the pickup and drop-off locations, the amount of the fare, and the tip each celebrity paid their driver. He summarized his findings in a blog post for the data management firm Neustar, where Tockar was interning at the time:
In Brad Cooper’s case, we now know that his cab took him to Greenwich Village, possibly to have dinner at Melibea, and that he paid $10.50, with no recorded tip. Ironically, he got in the cab to escape the photographers! We also know that Jessica Alba got into her taxi outside her hotel, the Trump SoHo, and somewhat surprisingly also did not add a tip to her $9 fare.
He went on: “Now while this information is relatively benign, particularly a year down the line, I have revealed information that was not previously in the public domain.”
Going by Tockar’s technique, I searched the archives of several celebrity-photo agencies for photos taken in 2013 that contained the words “taxi” or “cab” in their description or metadata. (I found I didn’t need to specify the location; New York seems to be the only city where celebrities regularly take taxis instead of private cars.) I then combed the results for photos in which the taxi’s medallion number is clearly visible.
Using that subset of photos, and the metadata associated with each, I was able to locate the same kind of ride data that Tockar gathered using the photos of Cooper and Alba. This process wasn’t particularly efficient. Sometimes the timestamps were off by an hour; sometimes the agency’s description wouldn’t specify the neighborhood—only “Manhattan” or “New York.” In those cases I went by the process of elimination: Figuring out which rides on a specific day, or at a specific hour, the celebrity did not take.
Once I singled out the specific ride, it was a matter of translating the geographic coordinates of the pickup and drop off, expressed as longitude and latitude, into street addresses, and plotting the most probable driving route between them using Google Maps.
Nothing especially surprising is revealed by this portion of the data; if anything, it confirms that celebrities like hanging out in Tribeca, the West Village, and often stay at the Bowery Hotel in the East Village and the Trump Hotel in Soho. What it does indicate, however, is the scope of information the Taxi and Limousine Commission is collecting about daily taxi rides—and how that data might be used in the future.
What is surprising is the number of celebrities mentioned here whose cabbies did not record a tip on top of the base fare: Jessica Alba, Jessica Biel, Amanda Bynes, Bradley Cooper, and Olivia Munn. Was the database wrong? Perhaps they forgot they weren’t using Uber?
Representatives for Biel and Bynes didn’t return requests for comment. Munn’s publicist was mostly confused by our inquiry. Alba had a fairly believable explanation. “This story is not accurate,” her publicist told Gawker. “Jessica always makes a point of giving a cash tip, even if she pays with a credit card.” After initially declining to comment and asking us to kill this story, Cooper’s publicist emailed us a statement: “Bradley takes the subway when he’s in New York and when he takes a taxi he leaves very good tips. No truth to this.”
As noted in the update below, it turns out that the database released by the taxi commission attributes a $0 to the vast majority of passengers who paid for rides in cash, as Alba, Biel, Bynes, Cooper, and Munn did. That doesn’t mean Alba et al did tip, but it does mean that we can’t say for sure that they didn’t.
* Update: Chris Whong points out that the TLC database appears to misreport the amount of cash tips for a number of rides—it’s unclear how many, though. We raised the possibility of incorrect data with each celebrity’s publicist when we contacted them, but we were not aware that the database records $0 tips for a statistically unusual number of passengers who pay in cash, as Alba, Biel, Bynes, Cooper, and Munn did.