Database of Emotional Videos from Ottawa (DEVO)

We present a collection of emotional video clips that can be used in ways similar to static images (e.g., the International Affective Picture System, IAPS; Lang, Bradley, & Cuthbert, 2008). The Database of Emotional Videos from Ottawa (DEVO) includes 291 brief video clips (mean duration = 5.42 s; SD = 2.89 s; range = 3–15 s) extracted from obscure sources to reduce their familiarity and to avoid influencing participants’ emotional responses. In Study 1, ratings of valence and arousal (measured with the Self Assessment Manikins from IAPS) and impact (Croucher, Calder, Ramponi, Barnard, & Murphy, 2011) were collected from 154 participants (82 women; mean age = 19.88 years; SD = 2.83 years), in a betweensubjects design to avoid potential halo effects across the three ratings (Saal, Downey, & Lahey, 1980). Ratings collected online in a new set of 124 students with a within-subjects design (Study 2) were significantly correlated with the original sample’s. The clips were unfamiliar, having been seen previously by fewer than 2% of participants on average. The ratings consistently revealed the expected U-shaped relationships between valence and arousal/impact, and a strong positive correlation between arousal and impact. Hierarchical cluster analysis of the Study 1 ratings suggested seven groups of clips varying in valence, arousal, and impact, although the Study 2 ratings suggested five groups of clips. These clips should prove useful for a wide range of research on emotion and behaviour.


Introduction
For decades, psychology and neuroscience have benefitted greatly from using standardized sets of static emotional images (e.g., the International Affective Picture System; IAPS; Lang et al., 2008) to learn about the nature of emotion and its influences on perception, cognition, and behaviour. Yet, the visual world is dynamic; real-life objects and scenes often involve motion. Thus, moving images ('movies' or 'video clips') can arguably provide greater ecological validity than static images, approaching closer to the real-world demands placed on visual perception and cognition. The current paper provides researchers with a new collection of emotional video clips, which can be used in ways similar to static image collections.
In addition to ecological validity, moving images present other potential advantages over static images. Motion is a powerful cue to object identity (Chen, Han, Hua, Gong, & Huang, 2003;Johansson, 1973;Regan, 1986;Ullman, 1979). Moving images have been argued to be 'behaviourally urgent' and to capture attention (for reviews, see Rauschenberger, 2003;Theeuwes, 2010) and boost physiological arousal more easily than static images (Detenber & Simons, 1998). Perhaps for these reasons, moving images can be more emotionally powerful than static images (Courtney, Dawson, Schell, Iyer, & Parsons, 2010). In addition, moving images are easier to remember than static images (Candan, Cutting, & DeLong, 2016;Ferguson, 2014;Ferguson, Homa, & Ellis, 2016;Matthews, Benjamin, & Osborne, 2007;Matthews, Buratto, & Lamberts, 2010) and textual stories (Baggett, 1979;Candan et al., 2016). This means that video clips may be particularly useful in memory studies in which researchers want to keep participants' performance off the floor, for example, when measuring memory using a rigorous test such as free recall, probing for details, or waiting a long time (i.e., days, weeks, or months) between study and test.
In light of these considerations, we present the Database of Emotional Videos from Ottawa (DEVO). It includes 291 video clips (3 to 15 s in duration) that are novel to most viewers, to minimize the confounding effects of familiarity. The clips portray a variety of scenes (e.g., human interactions, animals, nature, food/drink) obtained from motion pictures and amateur videos online. We provide ratings of valence, arousal, and impact (collected between subjects to avoid halo effects; Saal et al., 1980) from 82 women and 72 men in Study 1, and from 124 participants (collected within-subjects, on the advice of a reviewer) in Study 2. For the ratings of valence and arousal we used the method from the IAPS (Lang et al., 2008). The rating of impact came from Croucher et al. (2011), who have argued that it plays a more important role than the traditional concept of arousal does in emotion's effects on attention and memory. Because we aim to use these video clips in studies of attention and memory, we examined the degree to which each clip's rating on impact was correlated with its rating on arousal.
Another useful piece of information to have when using an emotional stimulus set regards the extent to which stimuli can be clustered or grouped together, especially along the three dimensions of emotion that we collected (i.e., valence, arousal, and impact). For this reason, we categorized video clips based on the three ratings using k-means and hierarchical cluster analyses. This will allow researchers to consider all three dimensions when selecting videos from the database.

Video Clip Selection Sources
One hundred and fifty-four clips were selected from Canadian or foreign movies and documentaries, generally excluding Hollywood and internationallyknown films. The remaining 137 clips were selected from online sources (YouTube, Vimeo, or other), depicting real-life events (e.g., cliff jumping or a natural disaster) or scripted scenes (e.g., person presenting a cake or chopping onions). For 78 sources, a minimum of two clips were extracted, generally one neutral and one emotional, to provide a tighter comparison between emotional and neutral stimuli (because certain video sources may contain unique attributes due to filming conditions and/or post-production effects). In these cases, effort was taken to minimize potential overlap in background context and characters. The clips were typically nominated by one team member and agreed upon by an additional member or more. Effort was made to find clips that would populate all quadrants of the valence-arousal/impact space, focusing particularly on finding positive valence-high arousal/impact clips. Each clip was assigned a unique number whereby the first value referred to a given source and the decimal referred to the clip number from that source (for sources where more than one clip was extracted). Full information about the sources [title, format (DVD or online), date it was released or uploaded online] is provided in the DEVO spreadsheet. Further information about the hue, saturation, and value of the clips is available in the appended files: Text S1.xlsx. and Text S2.pdf.
For access to the stimuli, email moviesstudyuo@ gmail.com or patrick.davidson@uottawa.ca. All data, including at participant level, are available in the DEVO spreadsheet.

Themes
The clips depict a variety of themes including emotional and neutral human interactions, animals, nature, and food/drink. In addition, each clip was scored by two separate researchers in terms of (see DEVO spreadsheet): people presence (1 = yes; 0 = no), animal presence (1 = yes; 0 = no), and presence of food and/or drink, including drugs (1 = yes; 0 = no).

Specifications
Videos were copied from an original DVD source (when available) or downloaded from an online source in the highest available resolution using DVDFab9 software. An initial set of 90 clips were trimmed to 3 seconds in duration using DVDFab9 software and saved in .wmv format (codec: windows media video and audio professional). The remaining 201 clips varied in duration (mean = 6.5 s; SD = 2.9 s; range = 3 s to 15.14 s), and were trimmed and saved in .avi format using Filmora software 1 (codec: H.264). Care was taken to minimize cinematographic effects, such as zooming, modified frame speed, or changing viewpoints, as these may momentarily disrupt visual processing (Hirose, 2010;Shimamura, Cohn-Sheehy, Pogue, & Shimamura, 2015;Shimamura, Cohn-Sheehy, & Shimamura, 2014;Smith & Henderson, 2008). Emotional clips were selected to include the peak emotional aspect of the longer segment/scene (when taken from a longer segment/scene), as identified by two researchers. All sound was removed using Filmora software.

Video Clip Validation
We collected ratings of each video clip in the laboratory.

Participants
Data from six participants were excluded from the analysis (four had an e-prime runtime error, one was selfexclusion, and one participant responded the same value to all questionnaires). This resulted in a final sample of 154 adults (82 women, 72 men; mean age 2 = 19.88 ± 2.83 SD; mean years of education = 13.27 ± 1.67 SD) who rated all 291 video clips. Participants were randomly assigned to either the valence, arousal, or impact rating condition (see Table 1). Participants were recruited from the University of Ottawa undergraduate research pool or from the Ottawa community using newspaper and social media ads. University students were given course credit for their participation and community participants were paid $10. The study was approved by the University of Ottawa research ethics board (#H08-14-25). All participants provided informed consent at the outset.

Procedures
Each trial began with a white screen displayed for 5 s, which allowed the upcoming video clip to buffer. The videos were presented one at a time in a pseudo-random order (that is, to prevent E-prime from crashing, we randomly assigned clips to one of three blocks, and randomized the order of presentation of clips within each block) at their original frame speed. At the offset of each video, participants were given as much time as needed to rate their subjective feeling of valence, arousal, or impact, depending on their condition assignment. Valence and arousal were measured using the Self Assessment Manikins from IAPS (Lang et al., 2008). Valence was rated from 1 (happy) to 9 (unhappy). Arousal was rated from 1 (excited) to 9 (calm), contrary to the IAPS which is from 9 (excited) to 1 (calm), to maintain coherence between the spatial organization of the keyboard (1-9) and the pictorial representation of arousal which goes from excited (left) to calm (right). Impact was also measured from 1 (no impact) to 9 (intensive impact), as per Croucher et al. (2011). For exact instructions, see Appendix C. Participants were given a printed copy of the instructions from the original research papers to refer to during the course of the experiment. The between-subjects design was used to avoid potential halo effects across the three ratings that may occur when participants provide multiple ratings consecutively (for a review, see Saal et al., 1980). As such, participants viewed all videos and rated each one on the same dimension. Following each selfreport rating, participants were given 2 s to respond yes to the following question: "Have you seen this video clip before?" The next trial would then begin with the white screen.
Participants viewed the videos in 3 blocks of 97 clips each, allowing for a break in between each block. They were tested either individually or in a room with up to three other participants, and were always monitored by the experimenter to ensure there were no adverse effects to viewing the emotional scenes. All participants were given three practice trials at the start of the experiment, with one negative, positive, and neutral clip, to familiarize themselves with the rating protocol as well as the emotional nature of the clips.

Statistical Analyses
First, we provide the ratings and familiarity scores for each video clip. Then we performed k-means and hierarchical cluster analyses on the mean ratings of valence, arousal, and impact for each video clip (averaged across all participants in the given condition), using SPSS Statistics 24 software. Two different clustering techniques were used to examine the underlying organization of the video clips in the database. 3 These clustering techniques seek to identify groups (or ' clusters') of relatively homogenous video clips that have maximum heterogeneity between one another. The advantage of these multivariate techniques is that they are able to calculate the similarity in ratings between clips, while considering all three dimensions simultaneously.
The first technique, k-means clustering, allows the user to input the desired number of outputted clusters. For the current database, a solution of three clusters was chosen because researchers commonly select stimuli based on their assigned valence category (positive, negative, neutral). To compute a three-cluster solution, k-means first establishes k random cluster means (in this case k = 3), and then assigns each video clip to the nearest cluster mean (Morissette & Chartier, 2013). The mean for each cluster of videos is then calculated, and clips are reassigned to the cluster mean that is closest in value. Reiterations continue until the classification of the video clips remains stable (i.e., the cluster memberships no longer change when the cluster means are recalculated).
The second technique, hierarchical clustering, identifies clusters of homogenous cases when the total number of clusters is unknown. Agglomerative hierarchical clustering begins by calculating the distance (here, the Squared Euclidean distance) between all video clips, after which the two clips with the smallest distance are joined together to form a cluster. The process continues whereby at each step either a new cluster is formed by joining two clips together or a clip joins a previously merged cluster (the distance between clusters was calculated using the average linkage between-groups method; for further details on the procedures, see Yim & Ramdeen, 2015). The clustering process continues until all clips form one large cluster; it must therefore be stopped before the clusters become too heterogeneous. The cut-off point can be determined by looking at the outputted agglomeration schedule, which lists the cases and clusters that are merged at each stage of the process as well as their relative heterogeneity (as indicated by a coefficient value; Yim & Ramdeen, 2015). The clustering was therefore cut-off before the first large increase in coefficients, to ensure that the groups of clips remained relatively homogeneous. From this, the total number of clusters was obtained and the different ratings for each cluster were compared. Summary information, including a description of the clips and their hierarchical cluster assignment, is provided in Table 2.

Familiarity
The familiarity scores of one participant were excluded because the participant misunderstood the instructions, and responded "yes" if the clips looked similar to previously shown clips. On average, participants had previously seen 4.93 of 291 clips (SD = 9.66; range = 0-96). Most participants (n = 143) had seen fewer than 5% of the clips before the study. Of the remaining 10 participants: 7 had seen 5-9%, 2 had seen 10-12%, and 1 had seen 33% of the clips. The familiarity scores were also averaged across all participants for each clip to determine whether certain clips were more familiar than others. On average, each clip had been seen by 2.59 participants (SD = 2.64; range = 0-15). Most clips (n = 273) had been seen by fewer than 5% of participants, demonstrating that the clips were generally unfamiliar. Of the remaining 18 clips: 10 had been seen by 5-6% of participants, 7 by 7-8% of participants, and 1 by 9.8% of participants. Note that in the attached spreadsheet containing these data, 0 = "unfamiliar" and 1 = "familiar." Valence, Arousal, and Impact Ratings The mean valence, arousal, and impact ratings for each video clip are provided in the DEVO spreadsheet. The mean valence rating was 4.92 (SD = 1.61), with scores ranging from 1.73 to 8.29. The mean arousal rating was 4.87 (SD = 1.16), with scores ranging from 2.17 to 7.50. And, the mean impact rating was 3.91 (SD = 1.28), with scores ranging from 1.59 to 7.09. The range in scores for arousal and impact (5.33 and 5.50 points, respectively) was somewhat smaller than the range in scores for valence (6.56 points). See the DEVO spreadsheet for the mean ratings for men and women and the standard deviations. Relations between the ratings of valence, arousal, and impact can be seen in Figures 1 to 7.

K-means Clustering
The k-means analysis required 8 iterations to reveal a stable cluster solution. This resulted in three clusters that differed in valence, arousal, and impact ( Table 3). Based on the mean and range of valence scores, the first cluster included more negative clips, the second more positive clips, and the third more neutral clips. Arousal and impact were highest for the 'negative' cluster, lower for the 'positive' cluster, and even lower still for the 'neutral' cluster.
Ratings of valence and arousal followed the typical inverted-U shaped relationship (Figure 1), with valence and impact showing the typical U-shaped pattern (Figure 2). This was expected because the arousal scale  Note: Valence was rated from 1 (happy) to 9 (unhappy); arousal from 1 (excited) to 9 (calm); impact from 1 (no impact) to 9 (intensive impact).
was inverted, with lower scores reflecting higher levels of arousal (note that Study 1 had similar findings when men and women data were separated). A strong linear relationship between arousal and impact scores can be seen in Figure 3. Despite most cases following the typical pattern, three clips from the 'negative' cluster stand out as they appear more neutral than the rest of the cluster and are more arousing and impactful than other neutral clips. This, in addition to the particularly variable 'positive' cluster, suggest that there may be even smaller groupings of clips in the database. The hierarchical cluster analysis was thus performed to undercover whether smaller, yet meaningful, clusters exist.

Men and women
The k-means clustering solution for men and women were very similar, forming three clusters of negative, positive, and neutral clips ( Table 4). Once again, arousal and impact were highest for the negative cluster, lower for the positive cluster, and lowest for the neutral cluster. More clips were clustered into the neutral cluster for men than for women (138 vs. 95 clips respectively), due to some clips being considered negative for women, but more neutral for men. Compared to men, women rated the videos higher in arousal and impact (see Figure 4).

Hierarchical Clustering
The clustering process was stopped after stage 284 (see the clustering script in Appendix A and the agglomeration schedule in Appendix B), because there was a 1.48 increase in the agglomeration coefficients, whereas the increase up until then was between 0.2 and 0.5. The large increase in the coefficient suggested that the clips being clustered together after that point were more heterogeneous than previously grouped clips. This eliminated the last 6 stages in the process, resulting in a total of seven clusters (because all clips are merged into one cluster at the end, removing each stage separates one cluster from a larger cluster each time). The self-report ratings for the seven clusters are summarized in Table 5. The same data points as above are displayed in Figures 5 to 7, this time illustrating the division into seven clusters. The clusters contained between 3 and 88 clips each. The mean and range of scores were examined in order to provide a meaningful description for each cluster. Because each rating was given on a 9-point Likert scale: a) valence scores between 1-3.5 were considered as positive, 3.5-6.5 as neutral, and 6.5-9 as negative; b) arousal scores between 1-3.5 as high, 3.5-6.5 as medium, and 6.5-9 as low; c) impact scores between 1-3.5 as low, 3.5-6.5 as medium, and impact (1 = no impact; 9 = intensive impact) ratings per video clip from Study 1. The three clusters from the k-means analysis represent negative (red), positive (blue), and neutral (beige) clusters.  6.5-9 as high. Based on the valence ratings, three clusters were identified as neutral, one cluster as positive, one cluster as negative, and two clusters with varying degrees of positive-to-neutral and negative-to-neutral scores (the greater variability in valence in the latter two clusters may have resulted from their large number of clips). The three neutral clusters differed in their mean ratings of arousal and impact: Clusters 3 and 6 were both medium arousal but the second was more impactful than the first, whereas cluster 7 was just as impactful as cluster 6 but much higher in arousal. In fact, the neutral cluster 7 contained only three videos, the same three that were described as unrepresentative of the negative cluster in the k-means analysis. When comparing the two groups of positive videos, cluster 2 was more positive, arousing, and impactful than the positive-to-neutral cluster 1. Similarly, the negative cluster 5 was more negative, arousing, and impactful on average than the negativeto-neutral cluster 4. The ratings of arousal and impact did not appear to differently influence the positive and negative cluster descriptions overall, suggesting that perhaps we  Neutral, high arousal, medium impact Note: Valence was rated from 1 (happy) to 9 (unhappy); arousal from 1 (excited) to 9 (calm); impact from 1 (no impact) to 9 (intensive impact). Note: Valence was rated from 1 (happy) to 9 (unhappy); arousal from 1 (excited) to 9 (calm); impact from 1 (no impact) to 9 (intensive impact).

Figure 5:
Mean valence (1 = happy; 9 = unhappy) and arousal (1 = excited; 9 = calm) ratings per video clip from Study 1, organized in seven clusters based on the hierarchical cluster analysis. only need to consider one of the two dimensions. To follow-up, the same cluster analysis was performed using only two dimensions at a time (valence and arousal, or valence and impact) to determine whether the same clustering solution would appear when considering only arousal or impact. This greatly changed the clustering of clips by entirely eliminating either the neutral, medium arousal/impact cluster or the positive medium arousal/impact cluster, both of which solutions rendered the already variable positive-to-neutral cluster larger. Although arousal and impact were highly correlated and appeared to influence the emotional clusters to the same extent, they were both crucial and necessary for generating the seven meaningful clusters.

Study 2
After we completed Study 1, reviewers made several suggestions, including that we collect more rating data, collect the three emotion ratings (i.e., valence, arousal, and impact) within-subjects, reverse the rating scale for arousal ratings, and collect data online. We made these changes and collected a new set of ratings of the full set of video clips from undergraduate students from the same university as in Study 1. Our primary goal was to determine the reliability of the ratings collected in Study 1 (collapsing men and women together).

Methods
Participants 124 adults (94 women, 30 men; mean age = 19.36 years ± 1.99 SD) were each presented all 291 video clips for rating (we removed six additional participants: Five were duplicate cases, and one had rated valence as "5" for almost all of the stimuli). Similar to Study 1, participants were recruited from the University of Ottawa undergraduate research pool, and given partial course credit for their participation. The study was approved by the University of Ottawa Research Ethics Board (#H08-14-25). All participants provided informed consent at the outset.

Procedures
We collected ratings of each video clip online, using Qualtrics (https://www.qualtrics.com), and participants made their ratings outside the laboratory. We encouraged participants to take breaks if they wished. The 291 video clips were presented one at a time (in a separate random order for each participant), with a rating of impact, arousal, and valence (always in that order) made for each video after it was shown. The ratings scales and endpoints were the same as in Study 1, with the exception that we reversed the scale for arousal to make 1 = calm and 9 = excited (so that it would match the direction of the impact scale). We kept the direction of the valence scale the same as in Study 1, for the sake of internal replication. Participants were then asked to rate familiarity by indicating whether they had seen that particular video before (answering "yes" or "no" for each).
To help participants, we presented three example video clips (resembling the ones from the database) at the beginning of the study, and had participants make ratings of each example video's impact, arousal, valence, and familiarity. The responses to these questions were not used for scoring, but rather to help the participant to become comfortable with the process. In addition, in our debriefing at the end of the study we asked: "We understand that when completing online studies, there are sometimes unavoidable distractions or interruptions. Please tell us if you took any breaks or were distracted/interrupted during the study (for how long, during which part, source of distraction, etc.)". This was to allow participants to reflect on their participation and share whether they had experienced any distractions that would make their responses unreliable.

Results
At the outset, we make two notes: First, many participants reported on debriefing that they had taken one or more breaks during the ratings, sometimes to handle a distraction (e.g., an incoming text or phone call). A handful reported that they had rated the videos while also performing an additional task (e.g., watching television, waiting for clients at work, or attending a lecture[!]), but this is probably to be expected.
Second, whereas in Study 1 all participants provided ratings of all of the videos they were shown (i.e., they did not "skip" any questions), in Study 2 no question received a 100% response rate. The response rates per video were, for arousal (M = 50%; SD = 9%; Range = 30-73%), impact (M = 52%; SD = 10% range = 26-76%), valence (M = 70%; SD = 7% range = 46-82%), and familiarity (M = 79%; SD = 2%; range = 74-84%). Several participants indicated in the online debriefing notes that they thought the survey was too long, and they may have skipped questions to get to the end faster. The lower response rates for arousal and impact compared to valence and familiarity may be important, but we have no clear idea how. Only five of the 129 participants indicated any sort of technical problem with the video presentation (e.g., "The last videos that [I] skipped wouldn't load"; "Some clips in the last quarter of the study wouldn't play"; "There are some videos that didn't work") which could have been a problem on their particular device. However, given the similarities between these data and those from Study 2 outlined below, we are confident that these data corroborate, and provide evidence for the validity of, the ratings that we collected in Study 1.

Familiarity
Focusing on the video clips, on average, each video clip had previously been seen by 2.5% of participants (range = 0-16.3%). 199 clips were familiar to less than or equal to 5% of participants, 54 clips to between 5% and 10%, and 7 clips were familiar to over 10% (of those 7, 2 to over 15% familiar).

Valence, Arousal, and Impact Ratings
The mean valence, arousal, and impact ratings for each video clip are provided in the DEVO spreadsheet. The video clips had an average valence rating of 5.13 (SD = 1.15; Range = 2.81-8.23), an average arousal rating of 4.19 (SD = 0.86; Range =2.30-6.40), and an average impact rating of 3.98 (SD = 0.94; Range = 2.11-6.89). The average standard deviations (across clips) for arousal, impact, and valence were 2.01, 2.06, and 1.26 respectively. Relations between the three rating variables can be seen in Figures 8 to 10.

Hierarchical Clustering
We chose to stop the clustering process after the 286 th step. After this step, there was a large increase in the coefficient in the agglomeration schedule, which can be found in Appendix B. Prior to the 286 th step, the coefficient increased in relatively small increments (ranging from a 0.01 to 0.33 increase). Figures 8, 9 and 10 show the five clusters from the hierarchical cluster analysis of the Study 2 ratings.
The five clusters varied in levels of valence, and arousal/impact. The cluster size, means, standard deviations, ranges, and descriptions for each cluster can be found in Table 6. The number of videos found in each cluster varied between 6 and 144. The means and ranges were taken into account in order to describe the clusters in a meaningful way. As in Study 1, we considered a valence rating between 1 and 3.5 as positive, a rating between 3.5 and 6.5 as neutral, and a rating between 6.5 and 9 as negative. For arousal and impact, we considered a rating between 1 and 3.5 as low, a rating between 3.5 and 6.5 as medium, and a rating between 6.5 and 9 as high.   We found five meaningful clusters, which varied in levels of valence: positive, positive-to-neutral, neutral, negativeto-neutral, and negative. No cluster had high levels of arousal or impact. Arousal and impact did not seem to differentially influence the clusters formed, perhaps because arousal and impact were highly correlated (as shown in Figure 10). The largest cluster had 144 videos with positive-to-neutral valence, low impact, and low arousal: These clips included videos of human activities (e.g., socializing, eating, dancing, kissing) and videos of animals. The smallest cluster had 6 videos with neutral valence, medium-high arousal and medium impact: These clips included ski cliff jumping, a shark biting an underwater camera, and parachute jumping.
The cluster with negative valence also had medium arousal and medium impact, including clips of violence between animals, hunting, human death, and pollution. The cluster with positive valence had medium-low arousal and mediumlow impact, including clips of marriages, baby animals, hockey shootouts, desserts and family relationships. Lastly, the negative-to-neutral valence cluster had medium-low arousal and medium-low impact, including clips of animals (e.g., wasps, snakes), people crying, and rushing water.

Spearman correlation comparing in-person and web-based ratings
Reassuringly, the ratings from the Study 2 (i.e., web-based) participants looked similar in rank-order to those from the Study 1 (i.e., in-person): Spearman correlation for arousal, ρ = -0.90; for impact, ρ = 0.93; and for valence, ρ = 0.98. Figures 11 to 13 show these correlations across the two sets of raters (i.e., Studies 1 and 2).

General Discussion
The purpose of this study was to create a set of 291 brief and unfamiliar video clips to complement existing libraries. We are making these clips available to colleagues, and would suggest local exploration and validation before using them. Nonetheless, as a starting point we have provided basic subjective ratings from young Canadians. Participants reported whether they had seen each video prior to the experiment, which indicated that the videos were generally unfamiliar. They also rated their feelings of emotional valence, arousal, or impact after each clip. As expected, there was a wide range of positive, negative, and neutral clips with varying degrees of arousal and impact. In general, positive and negative clips were more arousing and impactful than neutral clips. A closer examination of the underlying structure of the clips was provided by the k-means and hierarchical cluster analyses.

Arousal versus Impact
Researchers are beginning to explore the different effects of arousal and impact on emotion processing. Until recently, arousal garnered much of the research focus, especially when pertaining to emotional memory (e.g., McGaugh, 2000McGaugh, , 2015McGaugh et al., 1993). Yet, some work suggests that when arousal is controlled, impact influences visual attention allocation (Murphy, Hill, Neutral; medium-high arousal; medium impact Note: Valence was rated from 1 (happy) to 9 (unhappy); arousal from 1 (calm) to 9 (excited); impact from 1 (no impact) to 9 (intensive impact). Ramponi, Calder, & Barnard, 2010) and amygdala activation when viewing negative images (Ewbank, Barnard, Croucher, Ramponi, & Calder, 2009). In fact, impact may even be a better predictor of recognition memory than arousal, despite both dimensions being significantly correlated (Croucher et al., 2011). In the present set of ratings, arousal and impact were highly correlated and did not appear to differentially influence the positive, negative, and neutral clusters generated by the k-means analysis in Study 1. The patterns of arousal and impact seemed similar to one another in the hierarchical cluster analysis from Study 2, also. However, a number of subtle differences can be seen between arousal and impact, for example when closely examining the seven clusters from the hierarchical cluster analysis in Study 1. Despite their significant overlap, both arousal and impact contributed independently to the underlying structure of the clusters because the removal of one of the two ratings reduced the number and homogeneity of clusters. The structure of the clips in the database therefore depended on both arousal and impact, in addition to valence. Moreover, impact differentiated between two neutral, medium arousal clusters whereas arousal differentiated between two neutral, medium impact clusters. A final difference was observed in the extremity of scores: on average, impact scores were not as extreme as arousal. Only 9 video clips had a high impact score greater than 6.5, whereas 38 clips had a high arousal score smaller than 3.5. In fact, the highest average impact of a single clip was 7.09 compared to the lowest average arousal value of 2.17. This may explain why there were no clusters with high impact despite there being some clusters high in arousal (we note however that the average impact of clusters did gradually increase in parallel to arousal, yet simply did not surpass our absolute threshold for high impact). The difference in extremity of scores may also reflect subtle differences in the scales being used. Whereas the impact scale is clearly linear from no impact (1) to intensive impact (9), arousal is measured from excited (1) to calm (9), where the midpoint (5) represents a state that is neither calm nor excited (as per the IAPS instructions; Lang et al., 2008). More linear measurements of arousal (e.g., the affective slider from Betella & Verschure, 2016) could be useful for future work directly contrasting the role of arousal and impact on emotion processing. In addition, a reviewer noted when exploring the Study 1 ratings with a simultaneous regression predicting valence that impact but not arousal had an effect.

Cluster Analyses
We performed a three-solution k-means cluster analysis based on the ratings from Study 1 because a large number of researchers select stimuli based on one of three levels of emotional valence (positive, negative, neutral). This revealed a cluster of negative clips high in arousal/impact, a cluster of positive clips slightly lower in arousal/impact, and a cluster of neutral clips even lower in arousal/impact. Despite the utility of separating clips into these three groups, smaller subgroups of clips may be more meaningful. The range of valence scores for each cluster in the k-means solution was very wide, encompassing both emotional and neutral clips (negative cluster range = 4.48-8.29; positive cluster range = 1.73-5.02). Furthermore, three neutral clips were sorted into the negative cluster due to their increased impact/arousal, but they remained very different in terms of their valence, which is problematic for researchers selecting stimuli based primarily on valence. Without knowing the underlying number of clusters in the database, we performed a hierarchical cluster analysis which identifies subgroups of clips that are similar with one another, and maximally dissimilar from clips in other subgroups. This generated seven clusters of clips: three neutral, two negative or negative-to-neutral, and two positive or positive-to-neutral. Unique to this database, there were three types of neutral videos varying from  Art. 10, page 19 of 26 low to medium impact and low to high arousal. These videos will be particularly useful for researchers who aim to study valence independent of arousal. Arousal has been exceedingly challenging to control when using neutral pictures (e.g., of a blue mug or buffalo) or word stimuli (e.g., door, table, shoe) that do not readily capture participants' attention and are likely to quickly induce habituation. Videos on the other hand readily capture attention (Rauschenberger, 2003;Theeuwes, 2010) and boost physiological arousal more easily than pictures (Detenber & Simons, 1998). This will allow for a greater distinction between the contributions of valence and arousal to cognitive processing in future research. In addition, the hierarchical cluster analysis in Study 1 revealed a group of 25 positive videos that evoked medium levels of arousal and impact, which will provide a useful comparison for medium arousing/impactful negative videos. Moreover, both positive clusters and the negativeto-neutral cluster were also similar in terms of arousal and impact scores. However, the hierarchical analysis also revealed a unique 23-item cluster that included negative videos that were more arousing and impactful than nearly all other clips. This cluster included scenes of: violence toward humans (gun shooting, group fighting) and animals (whale hunting, animal carcass), a school bus drifting in a flood, animal farming, a man throwing up, and a man injecting drugs. This speaks to the variety of themes that can evoke intense negative feelings. Yet, it remains challenging to reliably evoke intense positive feelings, even when using videos. Here we included a variety of positive themes (animals, surfing, hockey shootouts, babies), but these were not as arousing or impactful as the negative videos. We also included scenes of intimate kissing with limited nudity, and although these evoked high levels of arousal, the ratings of valence were variable. As a result, this video database will be most useful for researchers comparing the effects of low-tomedium levels of arousal and impact between positive, negative, and neutral stimuli.

Familiarity
The novelty of experimental stimuli is vital to studies on emotion because participants' reactions may be influenced by prior exposure to a stimulus (Bransford & Johnson, 1972;Craik & Lockhart, 1972;Gabert-Quillen et al., 2015;Hannula, 2010;Hutchinson & Turk-Browne, 2012;Kuhl & Chun, 2014;Robertson & Köhler, 2007;Westmacott et al., 2004). In the present study, fewer than 2% of participants on average had seen each video clip prior to the experiment, meaning that their emotional reactions to the clips were not influenced by a previous experience. This is important because it can be challenging for researchers to objectively measure and control for participants' past exposure to stimuli. Although the majority of the videos were unfamiliar, seven clips in Study 1 had been previously seen by 7-8% of participants and one had been seen by nearly 10% of participants. The most familiar clip was of a hockey shootout, which may be less familiar to non-Canadian participants. The remaining clips that were rated as somewhat familiar were of humans, animals, food, and natural disasters, collected from online sources or movies/documentaries. Interestingly, other clips selected from these same sources were not highly familiar to participants, suggesting that participants either did not remember the source videos themselves or had seen other videos highly similar to these experimental clips and mistook them as being the same. We should point out here a possible influence on participants` familiarity ratings, that might have inflated the ratings for some clips: Even though each participant was shown each clip only once (and was told this would be the case), some participants may have erroneously thought that they had seen some of the clips before. This came up in the debriefing notes from Study 2 and may have to do with the fact that some clips were very similar to one another (for example, we have two separate Nashville Predators hockey players scoring on the goalie in a shoot-out practice, and both had high familiarity ratings). This is a potential hazard of presenting any participant more than one clip in a set. Thus, if anything, it is likely that overall familiarity of the clips in this set is slightly lower than indicated. To disentangle these responses in future work, participants could also rate how often they view scenes similar to the ones presented, on a continuous scale from never to very often (Libkuman, Otani, Kern, Viger, & Novak, 2007).

Applications of the Database
These video clips were selected to increase the affective realism (Camerer & Mobbs, 2017;Zaki & Ochsner, 2012) and ecological validity of emotion eliciting stimuli, without having to, for example, present live tarantulas and spiders during an experiment. They should provide a useful alternative to static pictures, which may be less arousing for participants today than they were two decades ago (Libkuman et al., 2007), perhaps due to the increased exposure to visual material on the internet. Including a large number of video clips that are controlled on various factors (e.g., duration, source, levels of arousal/impact) will reduce confounds when comparing emotional and neutral stimuli.
There may be various other factors that researchers think are important when selecting stimuli from this set. A reviewer noted that the relation between means and standard deviations for each video may vary across the dimensions of valence, arousal, and impact. For instance, for valence many raters seemed to converge on neutral ratings for many videos. That is, if a video seemed to be neutral in valence, many of the raters agreed on this. In contrast, for the rating of impact it seemed that most raters agreed on what the "low impact" videos were, leading to a small standard deviation value for the low mean ratings. However, as mean ratings of a video increased, so did the diversity of ratings (leading to a larger standard deviation value for the higher mean ratings).
This novel database of video clips can join existing film libraries. To date, close to 300 film clips have been used across dozens of studies (for a list of existing film clips, see Gilman et al., 2017). The current study (with 291 videos) doubles the number of clips available to researchers. Note that the present clips are different from the existing sets in several ways. First, the present clips are shorter than most of the ones in the existing literature. We selected relatively short clips because our primary aim was to use these as targets in future memory research, analogously to how the IAPS images pictures have been used (e.g., Bradley, 2014). Second, unlike most existing clips, the present ones are low in familiarity (as mentioned in the previous paragraph). Again, this will be a major advantage for studies of attention and memory, because familiarity can influence participants' emotional responses (Gabert-Quillen et al., 2015), eye movements (Hannula, 2010), attention (Hutchinson & Turk-Browne, 2012;Kuhl & Chun, 2014), and declarative memory (Bransford & Johnson, 1972;Craik & Lockhart, 1972;Robertson & Köhler, 2007;Westmacott, Black, Freedman, & Moscovitch, 2004). In fact, with a set of similar video clips to ours from Frederiks et al. (2019), researchers will have two compatible sets from which to choose individual stimuli.
These clips will be particularly useful for studies on emotional memory. First, videos capture attention (Rauschenberger, 2003;Theeuwes, 2010) and increase physiological arousal (Detenber & Simons, 1998) more easily than static images, which may intensify participants' emotional reactions. Second, videos are often better remembered than static images, referred to as the dynamic superiority effect (Buratto, Matthews, & Lamberts, 2009), because of increased attention at encoding or even their increased complexity and conceptual richness (Candan et al., 2016). Using videos as study material will greatly reduce floor effects in memory studies and will also allow researchers to employ longer study-test intervals. These behavioural effects can be studied in parallel to eventrelated potentials that depend on a large number of wellcontrolled stimuli.
In the present paper, we provide the average ratings of valence, arousal, and impact of the video clips and also discuss a meaningful way to organize the clips based on cluster analysis. When selecting stimuli for an experiment, most researchers use specific cutoff values to divide each dimension into different categories (e.g., positive, negative, neutral valence, or high, medium, low arousal). This leads to two problems. First, researchers do not employ the same cutoff values, rendering comparisons between studies more limited (compare the different thresholds used when selecting stimuli from IAPS in Mikels et al., 2005 andXing &Isaacowitz, 2006). Second, applying specific cutoff values means that researchers must rely on only one or two dimensions at a time while ignoring all other, possibly relevant, dimensions (Constantinescu et al., 2016). This practice ignores whether applying absolute thresholds for one or two dimensions approximates the internal structure of the stimuli in the database as there may exist subgroups of stimuli that go beyond the classifications of the selected dimensions (Constantinescu et al., 2016). In the present paper, we describe the internal structure of the video clip database that takes into account all three dimensions simultaneously. This will ensure that researchers compare groups of video clips that are maximally different on all three dimensions. This will be essential for developing a more accurate understanding of the relative influences of both impact and arousal on emotion processing.

User's Manual
Summary information for the 291 video clips is provided in Table 2. In the DEVO spreadsheet, we provide the following information: 1) clip number (referring to the source and the specific clip from that source); 2) duration in milliseconds; 3) source title; 4) original source format (including a URL when available); 5) source release date; 6) brief description; 7) presence of people; 8) presence of animals or insects; 9) presence of food, drink, or drugs; 10) number and percent of participants that had previously seen the video; 11) cluster assignment based on the hierarchical cluster analysis; 12) and the means and standard deviations of valence, arousal, and impact for men, women, and both men and women together.
The video clips can be obtained directly from the corresponding author. There are two versions for each clip: one with the original sound and one with the sound removed. The current set of ratings were collected for the clips without sound. [We removed sound to make the clips uniform -that is, some of the source clips had no sound, some had dialogue in English, some had dialogue in a language other than English, some had sound effects, and some had background music. Anyone interested in obtaining the clips with sound need only contact us.] However, we provide the original versions, which often contained dialogue (in English, French, or other), sounds of nature, or background music. The emotional reactions to the clips with sound should be similar to, if not more extreme than, what is reported here for the silent clips, but researchers should collect their own ratings prior to using them. We encourage researchers to check the ratings locally before using any of the video clips to ensure the ratings are consistent cross-culturally.
Finally, it is important to ensure that the appropriate codecs are installed on the computers that will be used during an experiment (H.264 codec for the .avi files and the windows media video and audio professional codecs for the .wmv files). If using E-Prime software to play the silent videos, you may select an option within the codec configuration called "skip audio" that allows the silent videos to play without an audio load error. For further help with codec configuration in E-Prime, visit: http://www.pstnet.com/support/kb.asp?TopicID=3162.

Conclusion
The Database of Emotional Videos from Ottawa provides researchers with a large set of ecologically valid and naturalistic videos suitable for emotion research. We provide standardized ratings of valence, arousal, and impact and recommend selecting videos based on all three dimensions, using the subgroups identified through hierarchical cluster analysis as a guide. Due to the wide range in arousal and impact ratings for both emotional and neutral videos, researchers will be able to better control for the differences between emotional and neutral stimuli when examining the effects of emotion on cognition. With its wide range of themes, future work could also examine the discrete emotions specifically elicited by each video.

Coefficients
Stage Cluster First Appears Hierarchical Cluster Analysis from Study 2: Agglomeration Schedule.

Appendix C Rating Instructions
Valence After viewing each clip, use the following scale to indicate how you felt in reaction to it on a happy vs. unhappy scale. On one extreme of the scale (1) you felt happy, pleased, satisfied, contented, hopeful, while viewing the video; whereas on the other end of the scale (9) you felt completely unhappy, annoyed, unsatisfied, melancholic, despaired, bored. If you felt completely neutral, neither happy nor unhappy, you can respond 5. Please provide the most accurate rating of how you felt in response to the video using any number between 1 and 9.