the project repository can be found here
Participles in English can form attributive clauses that look like this: the boy playing with the kitten, the people involved. Typically single participles stand before nouns, while participial clauses stand after nouns, though there are some exceptions like people involved. In my project I want to:
Check and describe the position of participial clauses in native English with regard to the noun.
Compare it to the use of participial clauses in learner English.
Detect the conditions in learner English in which a participle or a participial clause is used most frequently with a wrong position.
Categorical input: voice, verb lemma, head noun lemma. Numeric input: length of clause in words, distance from the head noun.
Dependent variables: position, type.
Hypotheses:
Language learners are likely to put the construction in wrong position, e.g. a dressed in black man (correct - a man dressed in black), the involved people (correct -- the people involved) due to the interference with their native language.
The formation of a relative clause instead of a participial construction depends on the length of the clause (the longer the unit is, the more likely it is to be used as a relative clause).
I have chosen BNC and REALEC as my sources of data. BNC represents L1 (native language) data, and REALEC represents L2 (learner language) data.
For my research I need to construct 2 datasets:
First I need to collect all the cases when a participle or gerund is used in an attributive clause. The steps I have taken to do this include the following (workflow is identical for both BNC and REALEC):
In the result, my dataframe for BNC consisted of 170 883 points of observation. The dataframe for REALEC consisted of 15 235 points of observation.
I decided to choose a fraction of these data, because working with such a large dataframe in R would be too slow, and also the comparison of L1 and L2 data would be more fair if the number of observation points was identical. Thus, I randomly chose 10 000 observation points from BNC and REALEC.
Let's look at the data and prove the concept of this research:
In [1]:
options(repr.plot.width=4, repr.plot.height=3)
library(tidyverse)
library(ggplot2)
In [2]:
bnc <- read.csv("BNC_data_sample.csv", sep='\t')
In [4]:
head(bnc, n = 10)
In [55]:
summary(bnc) # summary is not very informative, but just for fun...
In [3]:
bnc_num_after = nrow(filter(bnc, position == 'after')) # plot for BNC
bnc_num_before = nrow(filter(bnc, position == 'before'))
bnc_position = filter(bnc, position == 'after' | position == 'before')
print(bnc_num_after)
print(bnc_num_before)
plot1 <- ggplot(data=bnc_position, aes(position)) +
geom_bar(stat="count") +
ggtitle("Position of attributive participial construction with\nregard to the head noun in BNC") +
theme(plot.title = element_text(size=10)) + ylab("number of cases") + xlab("position")
#plot1
As we can see, there is more participial clauses that stand after the head noun than before it in L1 data. What about L2 data?
In [4]:
realec <- read.csv("REALEC_data_sample.csv", sep='\t')
In [24]:
real_num_after = nrow(filter(realec, position == 'after')) # plot for REALEC
real_num_before = nrow(filter(realec, position == 'before'))
realec_position = filter(realec, position == 'after' | position == 'before')
print(real_num_after)
print(real_num_before)
plot2 <- ggplot(data=realec_position, aes(position)) +
geom_bar(stat="count") +
ggtitle("Position of attributive participial construction with\nregard to the head noun in REALEC") +
theme(plot.title = element_text(size=10)) + ylab("number of cases") + xlab("position")
#plot2
In L2 data too clauses standing after the head noun are more frequent. Let's plot the data from the two corpora together, for greater clarity.
In [6]:
combined1 <- bnc_position
combined1$corpus <- rep('BNC', nrow(bnc_position))
combined2 <- realec_position
combined2$corpus <- rep('REALEC', nrow(realec_position))
combined <- rbind.data.frame(combined1, combined2)
plot3 <- ggplot(data=combined, aes(position, fill=corpus)) +
geom_bar(stat="count", position='dodge') +
ggtitle("Position of attributive participial construction with\nregard to the head noun") +
theme(plot.title = element_text(size=10)) + ylab("number of cases") + xlab("position")
plot3
The difference between numbers in REALEC seems to be more different than in BNC. But are these numbers really different? The null hypothesis is that there is no factor that influences the difference between REALEC and BNC. The hypothesis is that there is such a factor.
In [63]:
position <- matrix(c(bnc_num_after, real_num_after, bnc_num_before, real_num_before), nrow = 2)
rownames(position) <- c('bnc', 'realec')
colnames(position) <- c('after', 'before')
position
In [66]:
chisq.test(position)
OK, it seems the hypothesis is correct and our investigation could have some sense.
Task: Check and describe the position of participial clauses in native English with regard to the noun.
First, let's check how length of a participial clause influences the position of a clause. It is obvious that in L1 data longer units stand after the head noun (typically, all extended participial clauses), whereas single participles are placed before the noun.
In [6]:
plot4 <- ggplot(data = bnc_position, aes(position, length)) +
geom_boxplot(outlier.alpha = 0.1)+
geom_point(alpha = 0.1)
plot4
# compute lower and upper whiskers
#ylim1 = boxplot.stats(realec_position$length)$stats[c(1, 5)]
# scale y limits based on ylim1
#p1 = plot4 + coord_cartesian(ylim = ylim1*1.05)
#p1
Median value for clauses after noun is higher than for clauses before noun. This is exactly the case for BNC data, though some extended clauses are nevertheless found before the noun. Overall, a length of a clauses varies significantly from 1 (single participle) to 100 and more, so a confidence interval cannot be determined correctly. However, there is a significant number of outliers that stand before the head noun, though their length is more than 10 words.
What about REALEC data?
In [29]:
plot5 <- ggplot(data = realec_position, aes(position, length)) +
geom_boxplot(outlier.alpha = 0.1)+
geom_point(alpha = 0.1)
plot5
The picture in REALEC is even more consistent than it is in BNC. Once again, the majority of long clauses stand after the head noun, and all outliers (clauses that are very long) are also found after the head noun. However, as in BNC, there are some clauses of length above 10 that stand before the noun.
Now let's take a look at the interdependence of voice and position in a sentence.
In [39]:
plot6 <- realec_position %>% filter(voice == 'passive' | voice == 'active') %>% group_by(position, voice) %>% summarise(number = n()) %>%
ggplot(aes(position, voice, label=number)) +
geom_point(aes(size = number), colour="lightblue") +
geom_text() + scale_size(range = c(10, 30)) + guides(size = F) +
xlab("position") + ylab("voice") +
ggtitle("Correlation position of a clause and voice of a participle") +
theme(plot.title = element_text(hjust = 0, size = 9))
plot6
In [40]:
plot6 <- bnc_position %>% filter(voice == 'passive' | voice == 'active') %>% group_by(position, voice) %>% summarise(number = n()) %>%
ggplot(aes(position, voice, label=number)) +
geom_point(aes(size = number), colour="lightblue") +
geom_text() + scale_size(range = c(10, 30)) + guides(size = F) +
xlab("position") + ylab("voice") +
ggtitle("Correlation position of a clause and voice of a participle") +
theme(plot.title = element_text(hjust = 0, size = 9))
plot6
In [11]:
plot7 <- realec_position %>% group_by(position, type) %>% summarise(number = n()) %>%
ggplot(aes(position, type, label=number)) +
geom_point(aes(size = number), colour="lightblue") +
geom_text() + scale_size(range = c(10, 30)) + guides(size = F) +
xlab("position") + ylab("type") +
ggtitle("Correlation position of a clause and type of a clause") +
theme(plot.title = element_text(hjust = 0, size = 10))
plot7
In [27]:
realec_position %>%
ggplot(aes(position, length, color = position))+
geom_jitter(width = 0.3, alpha = 0.3)+
labs(title = "Length of a clause ~ position of a clause",
x = "position",
y = "length of a clause")
In [ ]: