In Defence of Liberty

Driven by data; ridden with liberty.

Data Quality on Social Media

data-quality-for-social-media

Data quality standards should be observed on social media. (Source: maxkabakov/iStockphoto)

After the wave of misleading and erroneous graphs following the 2016 elections, it is worth considering how we can ensure data quality and good standards of data visualisation on social media [1]. Misleading people is not a partisan concern: particularly when misinformation and misunderstanding can spread virally through a social network.

Poor data quality or misleading visualisations should be highlighted regardless of the story it purports to tell, or the political party or leader it supports.

According to the ISO 8000 definition of data quality, there are six intrinsic characteristics to data quality: syntax, semantics, data sources, fitness, accuracy and completeness [2].

The first two are more related to data storage, which is not that relevant to social media, but the latter four are important when we share graphs.

Data sources and fitness

Data sources are the most important: where is your data coming from? The Office for National Statistics (ONS) is, as the name suggests, the official statistics bureau for the United Kingdom. The ONS has recently launched a new website, where data series can be searched [3].

Time series in the ONS website have a four-digit code uniquely associated with them, called the time series ID. For instance, the employment rate has the code LF24.
Ideally, we should now cite – even if only in the footnotes – that four-digit code.
Government departments also produce reports, and data can be drawn from those.

Private organisations also measure various aspects of our society and economy, producing publicly-available reports, such as the CIPD reporting on zero-hours contracts [4]. The questions for your source: is it respected?

Citing sources should be precise. Simply saying some data comes from the ONS is not enough: where can the reader find that data? Is the data searchable? Link to it, or tell them how to find it.

Fitness is whether the data is right for the purpose being asked of it. There are multiple ways to measure unemployment, such as the level (the number of people aged 16-64 who are economically active but out of work), or the rate (the people aged 16-64 who are economically active but out of work, expressed as a percentage of those are who economically active). The measure that is better depends on the question.

Accuracy and completeness

Accuracy is another key part of data quality. It may be quite rare to spot a typographical error when someone copied over the data. However, accuracy concerns may be caused by people not recognising the difficulties with certain measures, or ignoring caveats. Surveys, even large surveys, will have some uncertainty – some margin of error – associated with those estimates.

An example of this accuracy problem comes from Dr Eoin Clarke, who regularly posts charts on Twitter, sometimes under the guide of Labour Left [5]. The ONS data for zero-hours contracts, which is derived from their Labour Force Survey, comes with the following caveat [6]:

Comparisons with 2012 and earlier years are complicated by a large increase between 2012 and 2013, which appeared to be mainly due to increased recognition of “zero-hours contracts”.

Dr Clarke instead chose to write on the graph:

The ONS says this data has a coefficient of variation of +/- 5% & can be considered accurate.

This is clearly inaccurate.

Completeness asks whether the data set is full, or are some data points missing. The absence of some data points is not a reason to entirely avoid using an incomplete data set, but that absence should be highlighted to the reader. This is also entwined with data visualisation, since cutting out parts of a time series can give a misleading impression.

Suggested guidelines

For social media, we should seek the following six guidelines on data quality:

  • Precise citations of the data source (such as time series ID for ONS data), linking to the source itself, where possible;
  • Accurate descriptions of what is being shown;
  • The data itself should be accurate, for which the publisher takes responsibility;
  • If the data source has provided a caveat on their data, ensure that this is also present on the graph;
  • Highlight any of missing data points, where this is not immediately clear from the graph itself;
  • The producer of a graph should be known, such as the Twitter username being placed on the graph itself.

This last standard seeks to offer accountability, so people can ask questions about the graph.

Misleading graphs can spread like wildfire: data quality standards for social media offer the firebreak.

References

[1] Waterson, J., 2016. That Election Map Everyone Is Saving Is Fake. BuzzFeed. Available from: https://www.buzzfeed.com/jimwaterson/the-viral-graphic-is-halfway-around-the-world [Accessed: 13th May 2016]

[2] Benson, P., 2008. ISO 8000 the International Standard for Data Quality. MIT. Available from: http://mitiq.mit.edu/IQIS/Documents/CDOIQS_200877/Papers/13_01_5A-1.pdf [Accessed: 13th May 2016]

[3] ONS, 2016. Time series tool. Available from: https://www.ons.gov.uk/timeseriestool [Accessed: 13th May 2016]

[4] CIPD, 2013. Zero-hours contracts: myths and reality. Available from: http://www.cipd.co.uk/hr-resources/research/zero-hours-contracts-myth-reality.aspx [Accessed: 13th May 2016]

[4] Masters, A., 2016. Again. In Defence of Liberty. Available from: https://anthonymasters.wordpress.com/2016/03/24/again/ [Accessed: 13th May 2016]

[5] ONS, 2016. Contracts that do not guarantee a minimum number of hours: March 2016. Available from: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/articles/contractsthatdonotguaranteeaminimumnumberofhours/march2016 [Accessed: 13th May 2016]

Advertisements

Information

This entry was posted on May 26, 2016 by in Social Media and tagged , , .
%d bloggers like this: