Do we really need standards, even in Data Science? That is the first question arising when addressing the topic of Data Science. Why should we apply standards like IBCS? Why is it necessary, when the developer and yourself understand the report? That is all it should do, right? Well maybe, when you are the only one using the report, but even then: when you leave the company/position then it takes a lot of time to understand the reports created on your watch.
So how can we make sure that reports, analyses and even data mining is understood easier without your explanation (and invested time) when you are at your new position? Well yeah, this is where some form of standards come into the play. A world known open standard is the standard of IBCS (International Business Communication Standard). This standard works with rules to display your data in reports in order to make sure you do not draw wrongful conclusions when viewing the report. BUT, those standards are not only applicable to reporting. In their course on Data Science in R, Harvard applies a part of the IBCS standard as well. Though they do not use the term IBCS (yet) and talk merely about “visualization principles”, we are both talking about the same for sure…
As more and more companies start to invest time (and money) in Data Science and the growing knowledge in R and/or Phyton, it is the perfect time to see what standards in visualizations can do for Data Science. Because “simply” just mining the data and convincing the viewer your outcome/conclusion is correct, is not the way to go. At least, that is not what I believe in. As an analyst, data scientist or business intelligence consultant, you need to take the viewer by the hand, and show them the aspects of what it is you found out, without the need to explain them how to read the report.
And this is where one of the IBCS standards, “same axes view” comes into play. Using and understanding graphs, using IBCS standards, will make sure the information is self-explanatory, sometimes accompanied with a short message capturing the essence. Let’s just dive into 2 examples (there are way more… When interested just get in touch), where the IBCS standards can be applied in Data Science and what it does for you as a viewer. One of the standards used in IBCS is the use of standards on the axes. In my opinion this one is rather important, as it can confuse you like crazy. First let’s take a look at the difference when in- and excluding 0 in reporting:
Above on the left a screenshot from Fox News is included, showing the southwest border apprehensions. If you look at the data shown from 2011 versus 2013, visually you see that there was a huge increase (3x as much). When starting at 0, the plot will look like the one shown on the right (the difference being the actual ~16%). This one shows you immediately that the difference isn’t actually that big. Amazing way to sell something, but not as great when you need insight in your actual data and the in- or decreases, that is for sure.
* Even though for the purpose of quick analyses in data mining and Data Science, this would be a little much, I do want to mention this really quick: When you only want to see the amounts and the difference over the year, a graph like shown above is sufficient. Though there is more to get out of a simple graph like that by using the Graphomate add on in e.g. reporting tools like SAP Web Intelligence, Design Studio or Lumira Designer. This way by using variances, the differences can be made even clearer like it is shown in the illustration on the right. The difference per year, compared to the previous year will then be shown in green (for positive change) or in red (for negative change). When adding this to reports, it would increase the readability and the viewer would immediately see the changes over the year visually.
Another aspect of the IBCS standards (included here) is the “common axes”. In my opinion this is one is the other most important aspect to be taken into account, especially in Data Science. When just looking at graph 1 below, it is not immediately clear whether females or males are on average taller. But once you take a closer look at the axes, you see that they are not the same (the MECE rule of IBCS is used here, as there is no need to display all empty columns before 50, though comparing both graphs they should have the same use of axes and both run from 50 to ~90). ). Imagine comparing not heights but costs of a business unit, a graph like below would make you assume the distribution is equal, though when using the same axes, you can see that the distribution on the right (graph 2) is higher.
When looking at the comparison in graph 1, it is even not very clear what the difference in distribution is. This is because of the way the graphs are put horizontally, but when looking at the graphs in number 2, the conclusion can be drawn way easier. You eye immediately sees that the distribution on males is higher then females. Another standard I think we should apply when comparing graphs. In a way this can be seen as an open interpretation of the way the IBCS standards apply the axes and position of the graphs.
In the IBCS standard it is stated that horizontal graphs “need to show a difference over time” and vertical graphs are used when comparing articles, business units, countries etc. You could see “male” and “female” (not the graph with its dual axes and the information inside, but the gender-category as a whole) in this case as a category, so that would mean you align them vertically and therefor underneath one another. This is a way I think we could apply this rule of representing data when it comes to Data Science and comparing multiple graphs with all different categories, but on the other hand: IBCS was created for reports with 1 axis measures and multiple categories or time periods, so applying it to Data Science (and with that 2 axes or more) is a new point of view. Though I think we could apply this way of representing and comparing, graphs whether they are show horizontally (time) or vertically (category), also on boxplots, bar charts, column charts etc.
For other aspects for Data Science (like e.g. k-mean or decision trees) there is still room for discussion when it comes to the standards. Whenever you have an idea about improvements, aspects to in- or exclude, please get in touch and let’s start the discussion.
P.S. When I triggered your interest in IBCS, please do not hesitate and contact JUGO: info@JUGO.nl