Last week at the Time Center here in New York City, L2 hosted a clinic on how digital is impacting consumer behavior. We were especially impressed with the presentation and discussion with Seth Stephens-Davidowitz, author of “Everybody Lies”. Seth’s research explores the hidden truths that our Google searches and Facebook posts reveal about us. Enjoy. We’ll see you in two weeks. I study what we can learn about the human psyche from internet behavior – from people’s internet behavior – but I’m going to start with the story that doesn’t have to do with the internet or people. It’s a story of a horse, This is American Pharoah. In 2015 he became the first horse in 37 years to win the Triple Crown, but when he was a one-year-old horse, and he was up for an auction in upstate New York, he didn’t look like a special horse. Nobody really knew that there was anything spectacular about American Pharoah. Nobody knew that American Pharoah was a special horse except for this guy Jeff Seder and Jeff Seder Is a guy with three degrees from Harvard and when he was 26 years old he was working at Citibank, and he looked at himself in a Three-piece suit in a mirror and said this isn’t me and quit his job at Citibank, moved to rural Pennsylvania decided he was going to work with horses and particularly he was gonna study what makes a great racehorse great and he knew how he wanted to do it. He was going to do it with data science, with trying to create big data sets that could correlate horse performance with various attributes and figure out what really matters and what doesn’t matter and for years his project was a bit of a failure. He started by measuring his first approach to predicting. What makes a great racehorse great was he measured the size of horses nostrils. He figured big nostrils would allow them to breathe better. He created the world’s largest – still to this day – the world’s largest horse nostril database ever created and correlated that with a ventral how the horses eventually did and he found out that did not work. It did not predict a racehorse success. He measured the size of their muscles, put kind of rulers around their legs, this maybe made a little more sense, correlate that with eventual outcomes. Found out again that did not predict racehorse success. He was an eccentric, kind of a weird guy. I met him in Florida. He once measured the size of horses defecations. He sat outside before they raced and kind of Measured how big their poop was. It wasn’t a stroke of genius. It didn’t work as well. Then one day he decided he’s gonna build an ultrasound, the world’s first ever horse ultrasound to measure the internal organs of horses and he could use this to measure many of their organs including the size of various attributes of their hearts and he found that the left ventricle – when he correlated the left ventricle with a ventral horse success – how many races they won – that was a massive predictor of how well the horses turned out. So when American Pharoah was one years old he was up for an auction. Jeff Seder was consulting with the client and he showed me this card he had about American Pharoah when he was 1 years old. He noted like everybody else that he had about average height 56 percentile about average weight, a little bit better than average pedigree, nothing too special about American Pharoah, except of course for his left ventricle, which was 99.6 first percentile left ventricle. Actually, among horses with left ventricles that big, most of them have diseases there are other organs are all tiny and they’ve enlarged left ventricle But with American Pharoah all his organs were pretty big and his left ventricle was enormous. So he told his his client basically, “This is a once-in-a-generation horse.” Obviously a successful prediction. There are few lessons for data scientists in this story that I take from Jeff Seder’s story. The first one is frequently the value of a data set is not how big it is It’s its newness that you have something new that nobody else has. If you think of Jeff Seder’s data set, it was about tens of thousands of horses in the world of big data where companies have billions or tens of billions of data points. A data set of tens of thousands of horses that can fit an Excel spreadsheet maybe won’t be so impressive, but it actually got the job done. But Jeff Seders genius was to go out and get new data that other people didn’t have Including building the first horse ultrasound to do this. You have to fail a lot on the way you don’t wake up with the left ventricle, you try a lot of things and fail over and over again maybe for decades before you eventually find the big winner. And I think more importantly and really exciting for people in the field of data science is there are left ventricles out there. There are things that if you know that about a customer or whoever, whatever you’re trying to measure, your model cannot be 10% better, cannot be 20% better, you might have a model ten times better just because you know one thing about people or horses or whatever it is that other people don’t know. So the obvious question is what are the left ventricles waiting to be found? And I think there are – whatever industry you’re in – there are left ventricles waiting to be found if you kind of follow this entrepreneurial process. But of course, you’re not in the business of studying horses you’re in the business of studying people and when you study people you have to keep in mind that everybody lies, which is the title of my book and the focus of my research over the last five years. So my favorite example – a little r-rated just to spice up the conference a little bit – is sexuality. This is the general social survey – the biggest sociological survey in the United States – every two years put out by the University of Chicago. They ask Americans how frequently they have sex, whether they use a condom, whether it’s heterosexual, homosexual sex. So you can do the math from this, women say they average sex about once a week, use a condom 20% of the time. This adds up in heterosexual sex, this adds up to 1.1 billion condoms reported by a woman using heterosexual sexual encounters. Do the exact same Questions as the general social survey does for men, do the exact same math and men report 1.6 billion condoms used in heterosexual sex every year, this Americans annually. And of course you see by definition those have to be the same so we kinda already know that somebody’s lying so who’s telling the truth men or woman? Neither. Nielsen calculates all the condoms sold every year – only 600 million condoms are sold every year, some used by gay men, some thrown out so basically everybody now is lying about sex, men just more than women. We have a new approach to understanding the human psyche and that’s people’s Google searches, when people are online, when they’re alone, they’re more honest and more importantly with Google you have an incentive to be honest to get the information you need and we can analyze this data. I called Google searches “digital truth serum” because people tell Google things that they might not tell to friends, family members, surveys, doctors, market researchers, many other sources and I think the main approach to understand people – surveys – are going to lose ground and at the least be supplemented by Google data which is Google Trends public data source anonymous and aggregate searches that everyone makes, you can compare them across time, across place and when you analyze Google Trends data as I’ve been doing for last five years you get a different picture of people than you do for other sources, so there are more searches for porn than weather Which is I think – If you ask people in a survey, about 20% of men and 4% of women admit to watching porn which is very hard to reconcile with this data. I think a hundred percent admit to checking the weather, but uh… So definitely a different view of people and I think in many ways a more honest view of people on Google than you’re gonna get from any other source. And Google Trends has proven itself better than surveys for measuring many, many outcomes When we kind of have some ground truth to compare itself, so predicting who will turn out and vote You can’t really predict turn out an election by asking people “Are you gonna vote?” if you ask people before an election “Are you gonna vote?” More than 50 percent of people who don’t vote tell surveys they’re going to vote, but you can predict with great accuracy where turnout’s going to be high based on people searching how to vote, where to vote, polling places in the weeks before an election. Predicting who is at risk of suicide. I’m working on a New York Times column about this right now and the predictive power in Google for suicide. People searching really, really disturbing stuff on Google – how to kill yourself, kill yourself, suicide, commit suicide – very, very high, much higher than surveys asking people if they have suicidal ideations, so just disturbing, but maybe hopefully could be used by mental health professionals for some sort of intervention. Measuring racism – this is how I started the research – I was doing my PhD in economics and I found Google Trends and became obsessed with it and one of the first things I did was to see to measure racism in the United States and I was shocked by what the data showed me on Google searches about racism. If you ask people in a survey “Are you racist?” ninety-nine, ninety-nine point five percent of Americans said “No, no of course not.” and I found that on Google, people are making racist searches with disturbing frequency this is the percent of Google searches that include the n-word, not the n-word, but like not the n-word, but like the actual n-word and darker red means a higher percent of searches – again it’s percent of searches, so it’s not because places are bigger or have higher populations that use Google more and the time period I was looking for people were searching this, with about the same frequency they were searching migrane and Lakers and Daily Show and economists so it wasn’t a fringe search it was millions of searches every year. The real divide in racism these days if you look at this map closely is not necessarily North versus South, it’s much much more East versus West where you see gets a lot lower as you get to the western part of the country. Anyway, I published this and I tried to everyone’s like this is so stupid why are you analyzing Google searches, it’s kind of considered weird back in the day and then When I was doing this, there was this idea we lived in a post-racial society and you know this extreme racism was a thing of the past and then of course we had a presidential candidate who said some very racially charged things and everyone thought he was gonna fall apart, and he didn’t so Nate Cohen – he’s a data journalist at the New York Times – he asked me during Trump’s rise to send over the racism data that I collected and he had a dataset of support for Trump in different parts of the country, he wanted to see if they were correlated. And he found it was the single highest correlation of anything he could find. Nate Silver reported the same thing, that Google searches, , racist searches were the single highest correlation with support for Trump in the Republican primary higher than economic variables or ideology or anything else you could measure. I make a big distinction between Google searches or search data, which is digital truth serum where people are so honest and another huge data set – Facebook data set – which I think is much less honest not just than Google, but even than anonymous surveys. I call Facebook “digital brag to my friends about how good my life is serum” because people aren’t really confessing their secrets to Facebook. They’re trying to show off to their friends and we see this over and over again that Facebook data doesn’t really correspond with reality so you can compare for example to magazines – the National Enquirer – trashy, gossipy magazine and The Atlantic – an intellectual, philosophical, poetic magazine and we actually know how popular these two magazines are – we have circulation data on these magazines and according to the Alliance for Audited Media The National Enquirer actually sells more copies than The Atlantic every year by about a factor of three, so it’s the more popular magazine, but if you look at Facebook data, Facebook Likes data where people are bragging to their friends about how good their lives are, you get a very different picture of the popularity of these two magazines where The Atlantic Monthly is forty five times more popular than the National Enquirer because of course people want their friends to think they’re so intellectual. I like to compare the two different data sources, big data sources Facebook social “digital brag to my friends about how good my life is” date and the Google “digital truth serum data”. This is how people describe their husbands on social media and on search starting with social media the top way people complete the phrase “my husband is” My husband is number one, is the best Number two “my best friend” “amazing: “the greatest” and “so cute” so very, very nice view of marriage according to social media. What happens on Google when people aren’t broadcasting their marriage to the world, but actually trying to get information. It turns out the third descriptor on Google of husbands, which I found actually very surprising but cute is actually “amazing” so that one kind of checks out as a reasonable description of husbands, but the other rounding out the top five on Google are “gay”, “a jerk”, “annoying” and “mean”. I think that a lot of the data analysis I’ve showed you so far is kind of just presenting how the world is where racism is highest or where magazines are more popular or how people describe describe their husbands, but I think the ultimate power of this data is to change the world and improve the world and this is maybe my favorite my favorite study that I’ve done because I think it does have so much potential to improve the world. It’s a study of Islamophobia and on Google a lot of people make some really, really nasty searches about Muslim Americans, it’s not a lot of people, but not a trivial number. You know, thousands, make really, really nasty search about Muslims. They search things like “kill Muslims” or “I hate Muslims” or “Muslims must die” or “Muslims are terrorists” – really, really nasty, horrible searches by people in a really angry frame of mind, sometimes the searches are late at night and you know not the most necessarily sane members of society making these searches and these seem kind of like such weird searches- do they have any information then when people are searching something like “kill Muslims”? Why someone has been searching that –
you know – they’re so enraged, well they actually do because we see when these searches are higher when more people are searching for “kill Muslims” And “I hate Muslims” there are more hate crimes against Muslims so Muslims are much more at risk, so these same searches kind of translate days or weeks, days later into attacks on Muslim Americans and there was one particular period where Islamophobic searches have been problematic over the last few years But there was one particular period where Islamophobic searches really got out of control. It was in December 2015 after the San Bernardino attacks if you remember that where two Muslim Americans shot up a group of his co-workers and immediately after this attack there were a huge rise in these nasty searches. The number one search about Muslims on Google right after the attack was “kill Muslims”. So people are kind of – in America – there are a lot of people that were enraged and doing some really nasty stuff and these predict that hate crimes were going to be higher, so a few days after the San Bernardino attack, I think Barack Obama knew that that Islamophobia was becoming problematic so he gave a speech about not just terrorism – about stopping terrorism but also about stopping Islamophobia filled with great moving, poetic lines, the responsibility of all Americans to reject discrimination. How it’s our responsibility not to give in to fear, to appeal to freedom, how it’s our responsibility to let in everybody in this country no matter their religious backgrounds, all the things that Americans should do and who we are and stuff and I was doing research during this speech on Islamophobia and Google searches you can break down minute by minute so I wanted to see what happened during and after the speech to all these nasty searches about Muslim Americans and I looked at the data, and I found out that for Islamophobic searches during and after the speech I found out that not only did the searches not drop, they didn’t stay the same, they shot way up, so there are more people searching “kill Muslims” and “I hate Muslim:, “no Syrian refugees”, less positive searches on Muslim Americans and this was kind of staying the same for long after the speech so this is a pretty pessimistic conclusion. The idea that even a beautiful speech by Obama that we think is helping people and that you know sounds like it’s doing all the right things actually could be backfiring but there was one more, maybe more optimistic conclusion in this speech. at the end of the speech, Barack Obama said that Muslim Americans are our friends and neighbors, they’re our sports heroes and they’re the men and women who will die for this country and you see right after this line, there is a huge rise in searches for Muslim athletes and Muslim soldiers. In fact, the top descriptor of Muslims for the first time in many years on Google was not Muslim terrorists or Muslim extremists, it was Muslim athletes followed by Muslim soldiers and those kept the top spot for about a week afterwards and you saw around the internet people, young men were saying “O,h Shaquille O’Neal’s Muslim. Muhammad Ali’s Muslim.” They didn’t know these things and so I kind of thought that there’s maybe a difference if you kind of think of the two different approaches. What didn’t work and what did work? I think what didn’t work is lecturing people, talking about responsibilities, things, they should do, giving them information they’ve been told a thousand times before, and what maybe is more effective is subtly provoking their curiosity, changing the way they think of the group that’s causing them so much rage and So we published this in the New York Times – a New York Times column – I don’t think it’s crazy when you write something in the New York Times that powerful people read it, maybe even someone in the president’s staff because a few days later Obama gave another speech, this time in a Baltimore mosque, and again, it was on national TV again. It got a lot of attention But you saw that Barack Obama’s strategy in this speech was very different. There was no talk of responsibility, no talk of what people should do. None of the lectures and sermons that people have been told a thousand times before. He really doubled down or tripled down or quadrupled down on the curiosity strategy where he said that Muslim Americans are farmers and merchants, Muslim Americans helped build us the skyscrapers of Chicago, Thomas Jefferson had a copy of the Quran in his office, really all kinds of new information changing how people think of this group causing them so much rage and I checked what happened to all these nasty searches of Muslim Americans after this nationally televised speech and you saw that just about all of them dropped. This really does show the power of these new big data sources when you have minute by minute searches that you can actually turn something as seemingly chaotic as how to calm an angry mob into a real science and I think you’re going to be seeing that on more and more areas where we really haven’t known the answers, we’ve just been appealing to our intuition and kind of patting ourselves on the back with this new data where you have this unprecedented window into the human mind from Google searches and you have this data in tiny locations minute by minute. You’re going to see more and more areas turned into real sciences. So that’s my book “Everybody Lies”. You can learn a lot more about this and all my research and stuff and I hope you utilize some of these tools in your work in the future and now Scott is gonna ask some questions with me. So that’s fascinating. So a couple questions: When you track racism, do you track it longitudinally, are we becoming less or more racist since the election? Yeah, so I think actually you haven’t seen a big rise in racism, I think you know what’s happened is a lot of the racism that was underground and always there is becoming more public but you haven’t seen a big rise. One thing you see in this data Set is the number of people who make nasty searches is way bigger than the number of people who commit nasty acts So it may be, so like even the kill Muslims thing. There are about 12,000 of those searches every year, and there are about 11 murders of Muslim Americans. So like they actually do have predictive power at the city level where you can say okay, these are higher than national level or they’re gonna be high, but the average person making these nasty searches is still very unlikely to commit a crime and the average probably teenager who’s searching I think one of the things we’re gonna learn is that probably a lot more teenagers search for how to make a dirty bomb or something than we realized and that it may be less of a predictive tool than we thought and we want to be a little careful not just for ethical or privacy reasons but just for data science reasons in kind of intervening when people are having bad thoughts or the suicidal stuff, I think there are 3.5 million searches for suicide every month in the United States And there are four thousand suicides, so they do have a huge predictive power at the area level but it’s still true that the average person who makes this search is very, very unlikely to actually commit suicide so you don’t want to like be knocking on people’s doors being like, putting them in mental hospitals because they search how to kill yourself on Google. Your greed glands get going around this because you start to think about have you tried to correlate this with when does someone start saying have you done searches around buying? cryptocurrency and then to see if there’s a lag or an impact on search volume around cryptocurrencies and price movements? Done a little bit of that. You can usually predict the volume of trading. So like when people are searching for gold a lot, you know the trading for Gold’s gonna be high but you don’t know necessarily by yourself, so I mean that still can be useful you can make some money on that but it’s harder to do the buy and sell and what I think people tend not to search buy gold they tend to search gold prices or something else so like the people who search buy gold is such a rare search that it doesn’t have as much predictive power. Could you use this type of data to highlight what would be the best places to build new stores? Yeah definitely I think another thing you can do, which I talked about my book Is there’s this concept of a doppelgänger search which is really powerful in data science so the way I first learned about it was Nate Silver before his predicted politics predicted athletes and his predictive model for athletes was way better than everybody else’s. He would analyze baseball players he’d find the most similar players to them on that current trajectory and then say how did those players do in the future? If you have like David Ortiz at 31 years old you say find David Ortiz Doppelgangers up into that point when they were you know, until they were 31 and see how they did in the next three years and then kind of make that prediction based on that and what you could do with opening up a store is you can say here is where our stores are most effective. Let’s find the city doppelgangers basically, and I think companies do that a little bit, but they’re limited by census data which only gives very, very limited variables on it but with Google search data, you have so many more variables where you can say Where you can get a much better predictor of like What your successful cities have in common and where there are other cities or towns or blocks that have similarities. So, how do you control for searches where people are just expressing their curiosity versus actual intent? Yeah, so I don’t know, that’s a really good question. At this point, I don’t know like if you had individual level data, you’d probably get a better sense of what’s just a curious search and what’s real meaningful search so I definitely, when I was doing this research made a lot of really, really nasty searches and actually didn’t get a knock on the door from the FBI. It’s kind of interesting but like I made those particular racist searches, I made them because I’m like, you know I wanted to know what came up and and how we can analyze the data so that would go into the data set. I think that’s a smaller percent of searches than people probably think which is why the data correlates so strongly over and over again so it can kind of handle a little bit of noise and a little bit of curiosity searches and a little bit of researchers or other factors, but the overwhelming reason that people make a search is not for that kind of curiosity or research but it’s because they actually are expressing some sort of attitude. So, on Facebook It’s not people, it’s their representative and when they’re on Google, it’s who they actually are. Have you seen any sort of screening or starching people’s intentions when they do searches be a voice as opposed to actually… Haven’t seen that. It would be interesting, you could imagine that people would be less likely to express uncomfortable things on voice. There is also – I just kind of assumed that Google searches are always going to be the most honest data source and people are always honest and when I first gave this talk about five years ago everyone just like laughed and like “Oh, of course Google, everyone’s so honest on Google” and now some people don’t have that like gut reaction of like I’m so honest on Google. They’re like I don’t tell Google everything, like I’m a little more cautious because of some of the things of the government knocking on your door and someone actually did a study after Edward Snowden’s revelations where they looked at some of these sensitive searches and they saw that there was about a four percent drop in kind of searches that people wouldn’t want other people to know they were making so searches for like whatever you know how to make a bomb or all these searches kind of dropped out 4% after Edward Snowden’s revelations. So I think there are, you know, people do respond a little bit to, you know, to privacy and stuff. Actually my favorite part of that study is they had to figure out what’s an embarrassing search so they used Amazon Turk and they just gave people a list, like gave users a list of searches, and asked them to rank how embarrassing the search was and like most of them are how to kill a bomb or like embarrassing sexual stuff or whatever, health conditions, anything you expect, but right up there at the top for embarrassing searches according to Amazon Turk people was Nickelback, the band. And you actually do see like just as there was a drop for sexual stuff and health conditions and there was also a drop for Nickelback searches after Snowden’s revelations. So, just last question – that’s not embarrassment. That’s just common sense. What’s the most counterintuitive finding across your research? I think just everything like I think I kind of talk around the book how my intuition is just always wrong so like the racism app was counterintuitive to me. I didn’t think that that would be the map of racism I’ve done a map of anxiety in the United States I would have thought anxiety would be highest in New York City like over educated intellectuals is not true, anxiety is highest in rural areas placed with lower levels of education, higher in upstate New York than New York City’s over and over again. You kind of build a model of how the world works based on your intuition your life experiences, and then you go to the data and it’s totally opposite from that. Seth, thanks for your good work. Thanks for sharing it.