By Elise Huard.

Big data seems more and more to be the "next big thing" (the speaker said half-jokingly "the next cloud"), which is something quite strange, as data has been there since a long time, and that jobs around it (data miner) always were considered pretty boring - seems that it's fashionable now. The reason is probably that lots of data are now available from various sources, and that a no small part is even publicly available.

The question asked was: how does this affect us as developers? How could we use this inside our own existing applications?

  • Visualisations: making google graphs or others, used for example at the guardian website or others (I'm personally a fan of "Cool Infographics", for both content and look)
  • Presence of lots of data with lots of dimensions, that our brains are not efficient to crunch, hence the need of statistics to help to extract useful information
  • One optic around this is exploratory data analysis, using tools such as R to crunch the data
  • What do you want to find ? Typically: Correlations, trends, clustering

A nice example of a correlation found via data crunching was the link between badly written reviews and low sells by Amazon (something that was probably not presumed, as, while logical, its probably not in the top 5 reason you would expect for bad sells).

The second part of the talk was more about methodology, with some nice if known points:

  • Our instincts are wrong (especially ours developers), so get out of the building, go talk to your customers or prospect to get informations/feedback
  • Make an assumption, choose a metrics, define a minimal viable product: typically a launch page, an email collector. This is something I've know heard three or four times, from different people, and that I believe in more and more: the "minimal viable product" is something that can be pretty minimalist. If many people are showing interest, well, time to build it for real (or at least a first, very limited version). You can then measure the chosen metric to validate the assumption, giving you the "fail fast pivot", i.e. a quick feedback allowing you to "pivot" (change strategy/product/anything) if its negative.
  • Double loop learning: the standard loop is something like: plan -> check -> adjust. When so many information/possibilities outside, it becomes something like: Assume -> plan -> check -> adjust
  • When you're collecting data, you should get it back to the developer, as it can be a good way to engage them, make them "feel" their work/impact (positive or negative). This is consistent with "everyone on the front line" approach (for example by 37signals): if your developers are never confronted with the result of their work, they will not be engaged.
  • Lastly, be careful: metrics are not the end of all. One of the risk is that data usage leads to small steps, small optimizations that lead you to a (very) local optimum, where you need bigger steps sometime. Again, use the cycle: assume, check.

An interesting talk, but more as an introduction to a reflection you should have by yourself. I would have liked more on the "data" subject, perhaps some tooling, or detailed use cases or experiences. I do think the topic is hot (as visible for example by the O'Reilly Strata conference and weekly posts), and could probably use a more thorough treatment.