Paul: I think maybe a great starting point would be if you could do a quick introduction to yourself, and maybe tell us a little bit about your background.
Gregory: Sure. I’m Gregory Piatetsky-Shapiro. I’m currently the Editor and President of KDnuggets. I started as a researcher in machine learning and databases, and got my PhD in 1984. I joined GTE Labs in 1985. GTE was a large telephone company in the US (now part of Verizon), and I was working there on application of intelligent methods to databases. I organized the first workshop on knowledge discovery in databases in 1989, and these workshops grew into the first conference in this area in 1995 (KDD-95) and now into a leading research conference in the field (www.kdd.org), with the next conference, KDD-2013 in Chicago. After GTE Labs I worked at a couple of startups and since 2001 I’ve been an independent consultant in data mining and analytics.
I started publishing KDnuggets in 1993 as a newsletter to connect researchers who attended the workshops on knowledge discovery; there were about 50 subscribers. I added a website in 1994 and created KDnuggets.com website in 1997. KDnuggets is now an influential voice covering the Analytics, Big Data, Data Mining, and Data Science space. KDnuggets audience has been growing and reached over 55,000 unique monthly visitors in the last 3 months, and over 25,000 subscribers via email, Twitter, Facebook, and other social networks.
Paul: How would you describe big data, and what does it actually mean for businesses, and particularly startups?
Gregory: The definition of Big Data I like best is “Data is Big when data size becomes part of the problem”. In addition to size or volume of data, there are two other important parameters, usually mentioned together as the 3Vs of Big Data (first defined by Doug Laney).
- Volume: think Google-size.
- Velocity: think Twitter.
- Variety: think text, images, links, videos, etc.
Those three Vs of big data make analyzing it more difficult, more challenging, and potentially more rewarding than a few years ago.
Google Trends for Big Data shows an explosive growth in popularity of this term, starting around 2011. “Big Data” is also the name of many conferences. Finally, it’s also the name of a Big Data journal that just recently started. All these meetings and publications try to cover this important trend (more trends and analysis in my presentation Analytics Industry Overview); there’s a lot more data now anywhere than any time before. It’s also a tremendous opportunity for many startups. I think Big Data is a tremendous opportunity, no less significant than the second Industrial Revolution.
Paul: An incredibly powerful thing to say. What sort of opportunities do you think startups can make of that?
Gregory: I think the biggest opportunities are in new platforms. Think Google and Facebook, they would not exist without big data. If the number of searches or web pages was very small, there would be no need for Google. If the number of connections between people was small, there would be no need for Facebook. Google and Facebook are two of the biggest companies, but you can think of local information too, perhaps FourSquare. You can think of recommendations, social media. Many combinations of mobile, social, and analytics would create very interesting startups. At the same time, Big Data is also generating a lot of hype – see my blog in HBR on Big Data Hype (and Reality).
Paul: Old businesses, I suppose, can use their own data to become better at their internal processes and become better operationally, also. Do you have any insights on what’s the right balance when you’re talking about using intuition versus data and experience? Where does data fit into that?
Gregory: I think intuition is great if your problem is small or medium-sized. If you burn your hand once, you know not to touch the hot stove again; you don’t need big data for that. But when the problem and data is big, we usually don’t have the relevant intuition. Again, think of Google; there is no intuition for what would be the right page to answer every query. Human intuition just doesn’t scale for such large problems. When there are lots of examples, lots of instances, lots of connections, big data can provide solutions to find what are the right processes. For example, Big Data can be pretty effective on an individual level, for example in targeting ads, determining which people are likely to buy which product, or switch telephone companies. However big data does not help when prediction is not on the individual level, but a large-scale system, such as an economy or political event. We only have one Earth, and we cannot run millions of simulations in parallel universes to create predictive models. On this Earth, we've only had a limited number of big financial crises, and so it is very hard to predict events like the financial crisis of 2008.
Paul: What would you say, in your experience, have been the best mechanisms or process that organizations can adopt to make sure that the data that they’re using is acted upon and not just mounting up in some data warehouse somewhere, or hidden under an analyst’s To-Do list? How can you make sure that insights are gained from that data and that an organization can actually take action based on it?
Gregory: I think you start with the right questions first. You really need to focus on the business goals and integrate the business goals into the entire process and iterate quickly. There is no point in building very large process just for the sake of discovering something interesting if it’s not actionable. Focus on what is actionable, iterate, build prototypes, and try to integrate data from different sources as much as possible. Also, realize the limitations. Not everything is predictable. Big data can improve the predictions, but it will not likely to make the predictions perfect.
For example, the Netflix prize is a good example (details in my blog in Harvard Business Review). Netflix users rank movies between 1 and 5 stars and the Netflix algorithm gives recommendations and estimated ranking. When Netflix announced its $1 million prize in 2007, the average error between estimated movie ranking and actual user ranking was almost a full star (actually 0.95 stars). The goal of the Netflix prize was to reduce that error by 10%, to 0.86 stars. The two interesting things that emerged from competition were:
- In just two weeks, several teams exceeded the accuracy of the Netflix algorithm (but only by a little).
- It took 3 years for the best teams to reach that the goal of 10% improvement. People are predictable, but not completely.
Paul: Is there anything else you’d like to speak about, or is there anything in particular that’s exciting you at the moment?
Gregory: Big Data and Analytics is a very interesting area. It’s exciting to see tremendous growth in companies, start-ups, research, and applications in this area. Data scientists are very much in demand and have a significant power in helping companies and society make important decisions. However, with power comes responsibility. Should data scientists adhere to a code of conduct? What should it contain? KDnuggets will be organizing a debate on this topic on Google Hangout April 10, 2013 at 1pm PT / 4pm ET/ 19:00 BST – the exact URL will be announced on KDnuggets about 30 minutes before .
Paul: Thank you for very much for taking the time today.
Gregory: Thank you very much, Paul. It was my pleasure.