Part 2. The FLOAT Method
2.4 Organize
Organize and Clean Data
Tips for Data Management:
- Save files in open formats: Saving files in open source formats, such as PDF, TXT, CSV, and RTF, will allow more people to open your work. It also prevents files from being inaccessible – even to you! – in the future (think of the floppy disk problem).
- Keep your computer organized: Adopting a file naming convention will increase the chance of finding the correct file. When naming files, make sure the name of the file is identifiable without relying on its containing folder. It is recommended to name a file by the project, then the subproject, then the file name, followed by the versioning, described next. You can version files by saving it as a new file with the same name, followed by a new version number or ISO format for date or date and time. That way, you have old versions of the file to refer to if there is an error or if you need to return to a previous version.
- Describe the project: Creating a metadata README file and codebook will help you document your process for later reference. It also helps others better understand your methods and therefore increase the chance that someone will be able to repeat your work. It also helps you remember the decisions you made when returning to your work in the future. Remember to also keep track of the software used to avoid version incompatibility in the future.
- Deposit your data: Even if you are unsure whether others will use your dataset, depositing your data in a trusted repository, such as Mavs Dataverse, will allow you to share your data and link it to any related research articles and publications. Since you and others can access quality versions of your data, this can increase the number of citations and help others better repeat your findings.
Organize
No matter the type of data — numbers, images, or otherwise — data quality is important. Inaccurate data or improperly structured data can have an impact on results. Data management is a way to organize and maintain the integrity of your data. This process involves engaging in practices including creating file naming standards for consistency and easier location, carefully recording all the details about the data – i.e., the metadata, including decisions made during the project, and saving copies of your data in more than one place to prevent loss.
In addition, data management includes creating and consistently following quality control procedures to prevent errors in data entry and data use as well as making plans to properly store your data after project completion. It is crucial to ensure that your data is correct, consistent, and usable by regularly describing the data, taking steps to prevent data errors and corruption, and identifying and correcting any variables that occur.
Data Management
Data management exists throughout the research lifecycle. However, data preparation is part of data hygiene. It includes data management and typically happens prior to data exploration and analysis. This process includes cleaning and restructuring data if needed. Even though there are massive amounts of data produced daily, much of that information is not accessible or not available in ready-to-use formats. More than 80% of all data generated today is considered unstructured or data with no pre-defined format or organization, making it much more difficult to collect, process, and analyze.
Structured data, on the other hand, is comprised of clearly defined data types whose pattern makes them easily searchable. When data is prepared for analysis, it usually is organized into a dataset or a collection of related sets of information, organized in rows and columns. Organized data is composed of separate elements but can be manipulated as a unit by a computer. Therefore, if we are to effectively create visualizations, we must first organize and analyze data to derive insights from it.
Data Cleaning
Data cleaning is the process of standardizing (or re-standardizing) data for consistency. Also referred to as “data cleansing,” data cleaning usually involves correcting inaccuracies or errors concerning a body of collected information. For example, if there is a “City” variable in a dataset and New York City, NY is entered as NYC, New York, NewYork, and New York City, it is difficult to quantify the total or average number of times that city is listed in the dataset. The cognitive (i.e., anticipating what cleaning needs to be done) and the manual part of the process is what can make data cleaning an overwhelming task.
While much of data cleaning can be done by software, it must be monitored and inconsistencies reviewed. This is why building a protocol for data cleaning is imperative.
Data Structuring or Restructuring
The research question that you formulated for the project will assist you in organizing the data. That question will guide you on the more essential elements of the data you are seeking to assemble and utilize. According to Hadley Wickham, “Data preparation is not just a first step, but must be repeated over the course of analysis as new problems come to light or new data is collected.” Wickham’s “Tidy Data” (2014) offers useful and straightforward principles for organizing data.
Wickham defines “tidy data” as “a standard way of mapping the meaning of a dataset to its structure.” Tidy data is a specific way of organizing data into a consistent format for most data tools.
One of the pitfalls of working with Excel is that data can be entered in an unstructured manner which will hinder data analysis. We organize data in spreadsheets in the ways that we as humans want to view the data, but computers require that data be organized in particular ways. In order to use tools that make computation more efficient, we need to structure our data the way that computers need the data.
It is not the only way to store data and there are reasons why you might not store data in this format, but, eventually, you will probably need to convert your data to a tidy format in order to efficiently analyze it.
Principles of Tidy Data
There are three rules which make a dataset tidy:
- Each variable must have its own column. A variable is a measurement type or category of values. This could be anything you are studying. Some examples include years, ages, percentages, names, temperatures, publication types, genders, and other characteristics. In a tidy dataset, all of the variables are listed across the top of the table in the first row, called the header row.
- Each observation must have its own row. An observation is a measurement based upon a particular criteria. Some examples include the responses of one participant in a survey, one publication’s metadata, or characteristics of a common contagious virus.
- Each value must have its own cell. If you consider the dataset to be a matrix, you would then match the variable column with the observation row to determine what goes in a cell. For example, in a dataset about music artists, information about when their first number one song was released would be in the cell that goes with that artist’s row and the column for Date of First Number One Song.
View the video below that explains tidy data, the five most common issues that break tidy data rules, and how to correct them.
By Peace Ossom-Williamson
(See bibliography for sources)
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.
Source: https://en.wikipedia.org/wiki/Data_cleansing
The FLOAT Method is a five-step process that facilitates your ability to outline your project by (1) conceiving of a research question, (2) explaining how you locate, (3) organize, and (4) analyze a given data source, and finally, (5) transforming your findings into a visualization.
Tidy data is an alternative name for the common statistical form called a model matrix or data matrix
Source: https://en.wikipedia.org/wiki/Tidy_data