How to prepare your spreadsheets so that they can be analysed efficiently is something you tend to learn on the job. In this blog post, I give some tips for organising datasets that I hope will be of use to students and researchers starting out in quantitative research.

Use self-explanatory names and labels, but keep them short

Try to minimise the need to consult the project codebook during the analysis. To this end, name your variables and their values sensibly. If you’re analysing questionnaire data, for instance, a variable name like Q3c means little to the analyst and requires them to refer back to the questionnaire or the codebook. Renaming this variable DialectUse makes it immediately clear that the variable likely encodes the degree to which the informant claims to use a dialect.

Similarly, a series of 0s and 1s in the Sex variable is uninformative. Sure, one of them will mean ‘man’ and the other ‘woman’, but which is which? Using the labels man and woman instead avoids any misunderstandings. Also note that values don’t have to be labelled numerically.

Lastly, use a consistent and unambiguous label to identify missing data. NA is usually a good choice, but don’t use 0 or -99 etc. Especially if you’re analysing numeric data, it may not be immediately obvious to the analyst that 0 or -99 don’t refer to actual measurements. (See exercises 2 and 3 on page 77 in my introduction to statistics, if you can read German.)

Transparent variables names and data labels facilitate the analysis because you don’t have to go back and forth between dataset and codebook to make sense of the data. A further benefit is that the code used for the analysis tends to be more readable, even months after the analysis has been conducted.

That said, when coming up with descriptive names and labels, try to keep them short. A variable name like HowOftenDoYouSpeakTheLocalDialectWithYourChild is self-explanatory but cumbersome to type repeatedly when analysing the data. DialectChild might be less self-explanatory but is easier to use.

Lastly, try to keep the following guidelines in mind:

  • Capitalisation matters. In R, an informant whose Sex is man is of a different gender than one whose Sex is Man. Any such confusions become clear soon enough when analysing the data and they're pretty easy to take care of. But avoiding them is better for the analyst's grumpiness level :)
  • Special characters are tricky. German Umlaut characters and French accents may show up without problems on your computer, but may require the manual specification of a character encoding on a different system. Try to either avoid special characters or know which character encoding you saved the dataset in (see the MS Office support page for Excel; LibreOffice users can just use the 'Save as' function and will be asked which encoding they prefer).
  • Don't use spaces, question marks or exclamation marks in variable names.
  • Colour-coding is fine for your own use, but the colours get lost when the dataset is imported into a statistics program.
  • 'Empty' rows or columns may contain lingering spaces that mess up the dataset when it's imported into a statistics program. By way of example, the csv file FormantMeasurements1.csv (available here) looks fine when it's opened in a spreadsheet program: Example superfluous spaces When the dataset is imported into R, however, there seem to be two additional variables (X and X.1), some missing values (NA) and a Speaker without an ID:
    dat <- read.csv("http://janhove.github.io/downloads/FormantMeasurements1.csv")
    summary(dat)
    ##  Speaker     Trial            Word          F1               F2        
    ## : 2 Min. : 1.00 baat : 8 Min. : 197.0 Min. : 581.9
    ## S1:24 1st Qu.: 6.75 biet : 8 1st Qu.: 278.5 1st Qu.: 899.7
    ## S2:24 Median :12.50 buut : 8 Median : 350.2 Median :1544.8
    ## S3:24 Mean :12.50 daat : 8 Mean : 488.0 Mean :1557.3
    ## S4:24 3rd Qu.:18.25 diet : 8 3rd Qu.: 772.2 3rd Qu.:2309.4
    ## Max. :24.00 duut : 8 Max. :1031.1 Max. :2773.7
    ## NA's :2 (Other):50 NA's :2 NA's :2
    ## X X.1
    ## Mode:logical Mode:logical
    ## NA's:98 NA's:98
    ##
    ##
    ##
    ##
    ##
    The reason is that there are superfluous spaces in cells D99 and G9.
  • Similarly, a value label with a trailing space will be inconspicuous in your spreadsheet program but will affect how the dataset is read in into your statistics program. The csv file FormantMeasurements2.csv contains the same data as FormantMeasurements1.csv, but when read into R, there seem to be two Speakers with the same ID (S2):
    dat <- read.csv("http://janhove.github.io/downloads/FormantMeasurements2.csv")
    summary(dat)
    ##  Speaker      Trial            Word          F1               F2        
    ## S1 :24 Min. : 1.00 baat : 8 Min. : 197.0 Min. : 581.9
    ## S2 :23 1st Qu.: 6.75 biet : 8 1st Qu.: 278.5 1st Qu.: 899.7
    ## S2 : 1 Median :12.50 buut : 8 Median : 350.2 Median :1544.8
    ## S3 :24 Mean :12.50 daat : 8 Mean : 488.0 Mean :1557.3
    ## S4 :24 3rd Qu.:18.25 diet : 8 3rd Qu.: 772.2 3rd Qu.:2309.4
    ## Max. :24.00 duut : 8 Max. :1031.1 Max. :2773.7
    ## (Other):48
    The reason is that there's a trailing space in cell A28. Less obvious (because it doesn't show up in the summary() output) is that there's also a superfluous space in the Word column (the carrier word tiet occurs twice, once with and once without trailing space):
    levels(dat$Word)
    ##  [1] "baat"  "biet"  "buut"  "daat"  "diet"  "duut"  "paat"  "piet" 
    ## [9] "puut" "taat" "tiet" "tiet " "tuut"
    As you can check for yourself, the trailing space is in cell C2. Problems like inconsistent capitalisation or trailing spaces are pretty easy to fix when you know they're there, but they aren't always obvious.

A ‘long’ data format is usually easier to manage than a ‘wide’ one

An example of a ‘wide’ dataset is Cognates_wide.csv (available here). The column Subject contains the participants’ IDs and every other column expresses whether the participant translated a given stimulus (e.g. ‘ägg’, ‘alltid’, ‘äta’ etc.) correctly. Every participant has their own row:

dat <- read.csv("http://janhove.github.io/downloads/Cognates_wide.csv",
fileEncoding = "UTF-8")
head(dat)
##   Subject agg alltid alska ata avskaffa bäbis bakgrund barn behärska bliva
## 1 64 0 0 1 1 0 0 1 0 0 0
## 2 78 0 0 0 0 0 0 1 0 0 0
## 3 134 0 0 0 0 0 0 1 0 0 0
## 4 230 0 0 0 0 0 0 1 0 0 0
## 5 288 0 0 0 0 0 0 1 0 0 0
## 6 326 0 0 0 0 0 1 1 1 0 0
## blomma borgmästare borja branna butelj byrå choklad cyckel egenskap elev
## 1 1 0 0 1 0 0 0 0 0 0
## 2 1 0 0 1 0 0 0 1 0 0
## 3 1 0 0 1 0 0 0 0 0 0
## 4 0 1 0 1 0 0 0 0 0 0
## 5 0 0 0 1 0 0 0 0 0 0
## 6 1 0 0 1 0 0 0 1 0 1
## ensam fåtölj fiende flicka fonster försiktig forsoka forst forsvinna
## 1 0 0 1 0 1 1 0 0 1
## 2 0 0 0 0 1 1 0 0 1
## 3 1 0 0 0 1 0 0 0 1
## 4 0 0 1 0 1 1 0 0 1
## 5 0 0 0 0 1 1 0 0 0
## 6 0 0 1 0 1 1 0 0 1
## förutsättning fotboll fraga frasch full ga grupp hård häst hemlig
## 1 1 1 1 1 1 1 0 1 0 0
## 2 0 1 1 1 1 1 0 1 0 0
## 3 0 1 0 1 0 0 1 0 0 1
## 4 0 1 1 0 1 0 0 1 0 0
## 5 0 1 0 0 0 0 0 1 0 0
## 6 0 1 0 1 1 0 0 1 0 0
## ingenjor intryck is kagel kanel karnkraftverk kejsar kniv konst
## 1 1 0 0 0 0 0 0 1 1
## 2 1 0 0 0 0 1 0 1 1
## 3 1 0 0 0 0 1 0 1 1
## 4 1 0 0 0 0 0 1 1 1
## 5 1 0 0 0 0 1 0 0 1
## 6 1 0 0 0 0 1 1 0 1
## korruption kung kyrka kyssa lang larm leka löpa markvardig mjölk möjlig
## 1 0 0 1 1 1 1 0 0 1 0 0
## 2 0 0 1 1 1 0 0 0 0 1 0
## 3 0 0 0 0 0 1 0 0 1 1 0
## 4 0 0 1 1 1 0 0 0 0 1 1
## 5 0 0 0 0 0 1 0 0 1 0 0
## 6 0 0 1 1 1 0 0 0 1 1 0
## mycket nackdel öppna ost overraska översätta paraply passiv potatis
## 1 0 0 1 1 0 0 1 1 1
## 2 0 0 1 1 0 0 1 1 1
## 3 0 1 1 1 0 1 1 1 1
## 4 0 0 1 1 0 0 1 0 0
## 5 0 0 0 0 0 0 0 1 0
## 6 0 0 0 1 0 0 0 1 1
## rådhus rytmisk saliv sitta sjalvstandig skarm skola skön skriva skrubba
## 1 1 1 1 1 1 0 1 1 0 1
## 2 1 1 0 1 0 0 1 0 0 0
## 3 1 1 0 1 1 0 0 0 0 0
## 4 1 1 0 1 0 0 1 0 0 0
## 5 0 0 0 0 1 0 0 0 0 0
## 6 1 1 0 1 0 0 1 1 0 0
## skyskrapa smart smink söka spegel språk städa stjärn sverige tanka tårta
## 1 1 1 0 0 1 1 0 0 0 1 1
## 2 0 0 1 0 1 1 0 1 0 0 1
## 3 1 0 1 0 0 0 0 1 0 1 0
## 4 1 1 0 1 1 1 0 1 0 0 1
## 5 1 1 0 0 1 0 0 1 0 0 0
## 6 0 1 1 0 1 1 0 1 0 1 1
## torsdag trakig tunga tvivla tydlig ursprung värld varm vaxla viktig
## 1 1 0 0 1 0 1 0 1 1 1
## 2 0 0 0 0 0 1 0 1 1 1
## 3 1 0 0 0 0 1 0 1 1 0
## 4 1 0 0 0 0 1 0 1 0 1
## 5 1 0 0 0 0 0 0 1 1 1
## 6 1 0 1 0 0 1 0 1 1 1
## ytterst
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0

In a ‘long’ dataset (e.g. Cognates_long.csv), the data are arranged differently, typically with one row per ‘observation unit’:

dat <- read.csv("http://janhove.github.io/downloads/Cognates_long.csv",
fileEncoding = "UTF-8")
head(dat); tail(dat)
##   Subject Stimulus Correct
## 1 1034 agg 0
## 2 2151 agg 0
## 3 9022 agg 0
## 4 7337 agg 0
## 5 8477 agg 0
## 6 6544 agg 0
##       Subject Stimulus Correct
## 16295 5290 ytterst 0
## 16296 3125 ytterst 0
## 16297 4137 ytterst 0
## 16298 1967 ytterst 0
## 16299 8942 ytterst 0
## 16300 9913 ytterst 0

I prefer long datasets for the following reasons:

  • The statistical tools I usually work with (mixed-effects models) require data in a long format.

  • It’s easier to add information to a long than to a wide dataset. For instance, if I wanted to add the order in which the participants saw the stimuli to the wide format, I’d have to add 100 columns to save the trial numbers (i.e. aggTrial, alltidTrial etc.). I’d only have to add one column (Trial) in the case of long data.

  • While wide datasets are useful for analytical techniques such as principal component analysis, factor analysis or structural equation modeling as well as paired t-tests, I find it easier to convert a long dataset to a wide dataset than vice versa, especially if the dataset contains additional information about the stimuli or trials.

A useful package for converting long data to wide data and vice versa is reshape2, whose use is helpfully explained by Sean Anderson.

Managing several smallish datasets and then joining them before the analysis is easier than handling one large dataset

Complex datasets often contain repeated data. For instance, we could add the participants’ age to Cognates_long.csv above, but we’d have to repeat each participant’s age 100 times. Similarly, we could add information about the stimuli to Cognates_long.csv (e.g. their length in phonemes), but we’d have to repeat each stimulus’s length 163 times (once for each participant). In such cases, I find it easiest to compile several smaller datasets with as little information repeated as possible and to combine them when needed. That way, if I make a mistake when entering the data, I’ll only have to correct it once and then recombine the datasets.

Here’s an illustration of what I mean. The participants’ accuracy per stimulus is stored in Cognates_long.csv.

dat <- read.csv("http://janhove.github.io/downloads/Cognates_long.csv",
fileEncoding = "UTF-8")

Background information about the participants is available in ParticipantInformation.csv:

participants <- read.csv("http://janhove.github.io/downloads/ParticipantInformation.csv",
fileEncoding = "UTF-8")
summary(participants)
##     Subject         Sex          Age            NrLang     
## Min. : 64 female:90 Min. :10.00 Min. :1.000
## 1st Qu.:2990 male :73 1st Qu.:16.00 1st Qu.:2.000
## Median :5731 Median :39.00 Median :3.000
## Mean :5317 Mean :40.28 Mean :3.067
## 3rd Qu.:7769 3rd Qu.:59.50 3rd Qu.:4.000
## Max. :9913 Max. :86.00 Max. :9.000
##
## DS.Span DS.Total GmVoc Raven
## Min. :2.000 Min. : 2.000 Min. : 4.00 Min. : 0.0
## 1st Qu.:4.000 1st Qu.: 5.000 1st Qu.:29.00 1st Qu.:12.0
## Median :4.000 Median : 6.000 Median :34.00 Median :19.0
## Mean :4.613 Mean : 6.374 Mean :30.24 Mean :17.8
## 3rd Qu.:5.000 3rd Qu.: 7.500 3rd Qu.:36.00 3rd Qu.:24.0
## Max. :8.000 Max. :12.000 Max. :41.00 Max. :35.0
## NA's :1
## EnglishScore
## Min. : 3.00
## 1st Qu.:20.75
## Median :31.00
## Mean :28.30
## 3rd Qu.:37.00
## Max. :44.00
## NA's :3

The entries in the Subject column in Cognates_long.csv correspond to those in the same column in ParticipantInformation.csv. Using merge(), we can add the information in the latter dataset to the corresponding rows in the former:

dat <- merge(dat, participants, "Subject")
head(dat); tail(dat)
##   Subject      Stimulus Correct    Sex Age NrLang DS.Span DS.Total GmVoc
## 1 64 forst 0 female 27 4 6 7 34
## 2 64 löpa 0 female 27 4 6 7 34
## 3 64 ost 1 female 27 4 6 7 34
## 4 64 tunga 0 female 27 4 6 7 34
## 5 64 förutsättning 1 female 27 4 6 7 34
## 6 64 ensam 0 female 27 4 6 7 34
## Raven EnglishScore
## 1 28 42
## 2 28 42
## 3 28 42
## 4 28 42
## 5 28 42
## 6 28 42
##       Subject  Stimulus Correct    Sex Age NrLang DS.Span DS.Total GmVoc
## 16295 9913 grupp 1 female 40 3 5 8 29
## 16296 9913 kung 0 female 40 3 5 8 29
## 16297 9913 städa 0 female 40 3 5 8 29
## 16298 9913 tanka 1 female 40 3 5 8 29
## 16299 9913 skyskrapa 0 female 40 3 5 8 29
## 16300 9913 rytmisk 0 female 40 3 5 8 29
## Raven EnglishScore
## 16295 17 17
## 16296 17 17
## 16297 17 17
## 16298 17 17
## 16299 17 17
## 16300 17 17

Similarly, we can add information about the stimuli, available in StimulusInformation.csv, to dat.

stimuli <- read.csv("http://janhove.github.io/downloads/StimulusInformation.csv",
fileEncoding = "UTF-8")
summary(stimuli)
##      Stimulus      Status      Duration          Swedish          German  
## agg : 1 profile:10 Min. : 516.0 2st : 1 abschaffen: 1
## alltid : 1 target :90 1st Qu.: 708.8 2v@raska: 1 ai : 1
## alska : 1 Median : 869.5 alltid : 1 aig@nSaft : 1
## ata : 1 Mean : 843.4 avskaffa: 1 aindruk : 1
## avskaffa: 1 3rd Qu.: 929.8 b2rja : 1 ainzam : 1
## bäbis : 1 Max. :1264.0 babis : 1 (Other) :75
## (Other) :94 NA's :50 (Other) :94 NA's :20
## English French LevGer LevEng
## ais : 1 alarm : 1 Min. :0.1429 Min. :0.0000
## b2rn : 1 bebe : 1 1st Qu.:0.4000 1st Qu.:0.4000
## baby : 1 bureau : 1 Median :0.5000 Median :1.0000
## background: 1 butEj : 1 Mean :0.5859 Mean :0.6997
## blum : 1 cannelle: 1 3rd Qu.:0.8333 3rd Qu.:1.0000
## (Other) :42 (Other) :18 Max. :1.0000 Max. :1.0000
## NA's :53 NA's :77
## LevFre FreqGerman FreqEnglish FreqFrench
## Min. :0.0000 Min. : 0.000 Min. : 0.0 Min. : 0
## 1st Qu.:1.0000 1st Qu.: 0.932 1st Qu.: 0.0 1st Qu.: 0
## Median :1.0000 Median : 20.020 Median : 0.0 Median : 0
## Mean :0.8529 Mean : 109.714 Mean : 119.4 Mean : 237
## 3rd Qu.:1.0000 3rd Qu.: 60.138 3rd Qu.: 43.6 3rd Qu.: 0
## Max. :1.0000 Max. :3460.090 Max. :3793.0 Max. :22823
##

The shared denominator of both datasets is stored in the column Stimulus:

dat <- merge(dat, stimuli, "Stimulus")
head(dat); tail(dat)
##   Stimulus Subject Correct    Sex Age NrLang DS.Span DS.Total GmVoc Raven
## 1 agg 7479 0 male 25 4 8 11 38 34
## 2 agg 7337 0 female 49 3 4 5 36 19
## 3 agg 2846 0 female 11 3 4 6 4 15
## 4 agg 5153 0 female 51 3 4 6 37 18
## 5 agg 6510 0 female 41 2 5 8 34 30
## 6 agg 6329 0 female 56 3 5 6 37 21
## EnglishScore Status Duration Swedish German English French LevGer LevEng
## 1 43 target 547 Eg ai Eg <NA> 1 0
## 2 30 target 547 Eg ai Eg <NA> 1 0
## 3 6 target 547 Eg ai Eg <NA> 1 0
## 4 37 target 547 Eg ai Eg <NA> 1 0
## 5 40 target 547 Eg ai Eg <NA> 1 0
## 6 42 target 547 Eg ai Eg <NA> 1 0
## LevFre FreqGerman FreqEnglish FreqFrench
## 1 1 24.06 26.04 0
## 2 1 24.06 26.04 0
## 3 1 24.06 26.04 0
## 4 1 24.06 26.04 0
## 5 1 24.06 26.04 0
## 6 1 24.06 26.04 0
##       Stimulus Subject Correct    Sex Age NrLang DS.Span DS.Total GmVoc
## 16295 ytterst 1350 0 male 43 4 7 10 38
## 16296 ytterst 7337 0 female 49 3 4 5 36
## 16297 ytterst 8578 0 male 11 2 4 6 9
## 16298 ytterst 717 0 female 15 2 5 6 29
## 16299 ytterst 8805 0 male 70 4 4 4 35
## 16300 ytterst 9912 0 male 10 2 4 4 9
## Raven EnglishScore Status Duration Swedish German English French
## 16295 31 38 target NA ytterst ausserst <NA> <NA>
## 16296 19 30 target NA ytterst ausserst <NA> <NA>
## 16297 20 17 target NA ytterst ausserst <NA> <NA>
## 16298 21 25 target NA ytterst ausserst <NA> <NA>
## 16299 22 35 target NA ytterst ausserst <NA> <NA>
## 16300 14 9 target NA ytterst ausserst <NA> <NA>
## LevGer LevEng LevFre FreqGerman FreqEnglish FreqFrench
## 16295 0.5 1 1 23.19 0 0
## 16296 0.5 1 1 23.19 0 0
## 16297 0.5 1 1 23.19 0 0
## 16298 0.5 1 1 23.19 0 0
## 16299 0.5 1 1 23.19 0 0
## 16300 0.5 1 1 23.19 0 0

Reading tip

Hadley Wickham’s Tidy data, while primarily geared towards software developers, is well worth a read.


Published

18 June 2015

Tags


comments powered by Disqus