/ Clean

Data Scientist == Data Janitor

Each variable is a column, each observation is a row.

A tremendous part of the work of a data scientist will always be the process of cleaning and preparing data. Mostly that data comes from various sources, is unstructured, contains missing fields and confusing column labels. In short it sure isn't as nicely prepared as a Kaggle competition dataset.

Hadley Wickham's "Tidy Data" whitepaper gives you a good basis and guidelines on how to structure your data. The much cited "Each variable is a column, each observation is a row" quote originated in this whitepaper:

Tidy Data by Hadley Wickham

Additionally a good read is the Dirty Data Article by Kdnuggets.

Data Scientist == Data Janitor
Share this

Subscribe to All things Data Science