Skip to content

Hadley Wickham defines Tidy Data

For non-specialists: If you’ve ever wondered why it’s so hard to get all the data points you want into a simple, workable database and what you can do about it, this is a good place to start. Tidy Data is a way of structuring data so that it’s more easily machine-readable and reformattable. It provides a methodology for tidying data. It also eliminates the ambiguity between zero-values and missing values, and reduces the number of empty cells in tables. Recommended.

It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected. To get a handle on the problem, this paper focuses on a small, but important, aspect of data cleaning that I call data tidying: structuring datasets to facilitate analysis.

Tidy data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

This is Codd’s 3rd normal form (Codd 1990), but with the constraints framed in statistical
language, and the focus put on a single dataset rather than the many connected datasets
common in relational databases. Messy data is any other other arrangement of the data.

Hadley Wickham – Tidy Data, Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II

I'd love to hear your thoughts and recommended resources...