conda create -n poly1
source activate poly1
conda install -c conda-forge spacy
python -m spacy download en_core_web_trf
conda install numpy scipy pandas
This post provides some quick details on using
reticulate
to interface Python from RStudio; and, more specifically, using thespacy
library and BERT for fine-grained lexical semantic investigation. Here we present a (very cursory) usage-based/BERT-based perspective on the semantic distinction betweenfurther
andfarther
, using example contexts extracted from the Corpus of Contemporary American English (COCA).
1 Python & reticulate set-up
The Python code below sets up a conda environment and installs relevant libraries, as well as the BERT transformer, en_core_web_trf. The following should be run in the terminal.
The R code below directs R to our Python environment and Python installation.
Sys.setenv(RETICULATE_PYTHON = "/home/jtimm/anaconda3/envs/m3demo/bin/python")
library(reticulate)
::use_condaenv(condaenv = "poly1",
reticulateconda = "/home/jtimm/anaconda3/bin/conda")
2 COCA
The Corpus of Contemporary American English (COCA) is an absolutely lovely resource, and is one of many corpora made available by the folks at BYU. Here, we utilize COCA to build a simple data set of further
-farther
example usages. I have copied/pasted from COCA’s online search interface – the data set includes ~500 contexts of usage per form.
library(tidyverse)
<- read.csv(paste0(ld, 'further-farther.csv'), sep = '\t')
gw $sent <- tolower(gsub("([[:punct:]])", " \\1 ", gw$text))
gw$sent <- gsub("^ *|(?<= ) | *$", "", gw$sent, perl = TRUE)
gw
$count <- stringr::str_count(gw$sent, 'further|farther')
gw<- subset(gw, count == 1) gw0
For a nice discussion on the semantics of further
-farther
, see this Merriam-Webster post. The standard semantic distinction drawn between the two forms is physical versus metaphorical distance.
Some highlighting & sample data below.
<- '\\1 <span style="background-color:lightgreen">\\2</span> \\3'
fu <- '\\1 <span style="background-color:lightblue">\\2</span> \\3'
fa
$text <- gsub('(^.+)(further)(.+$)', fu, gw0$text, ignore.case = T)
gw0$text <- gsub('(^.+)(farther)(.+$)', fa, gw0$text, ignore.case = T)
gw0$text <- paste0('... ', gw0$text, ' ...')
gw0
set.seed(99)
%>% select(year, genre, text) %>% sample_n(10) %>%
gw0 ::datatable(rownames = F, escape = F,
DToptions = list(dom = 't',
pageLength = 10,
scrollX = TRUE))
Lastly, we identify the location (ie, context position) of the target token within each context (as token index).
$idx <- sapply(gsub(' (farther|further).*$', '', gw0$sent, ignore.case = T),
gw0function(x){
length(corpus::text_tokens(x)[[1]]) })
3 BERT & contextual embeddings
Using BERT and spacy
for computing contextual word embeddings is actually fairly straightforward. A very nice resource for some theoretical overview as well as code demo with BERT/spacy is available here.
Getting started, we pass our data set from R to Python via the r_to_py
function.
<- reticulate::r_to_py(gw0) df
Then, from a Python console, we load the BERT transformer using spacy
.
import spacy
= spacy.load('en_core_web_trf') nlp
The stretch of Python code below does all the work here. The transformer computes a 768 dimension vector per token/sub-token comprising each context – then we extract the tensor for either further
/farther
using the token index. The resulting data structure is matrix-like, with each instantiation of further
-farther
represented in 768 dimensions.
def encode(sent, index):
= nlp(sent.lower())
doc = doc._.trf_data.align[index].data.flatten()
tensor_ix = doc._.trf_data.tensors[0].shape[-1]
out_dim = doc._.trf_data.tensors[0].reshape(-1, out_dim)[tensor_ix]
tensor ## tensor.__len__()
return tensor.mean(axis=0)
"emb"] = r.df[["sent", "idx"]].apply(lambda x: encode(x[0], x[1]), axis = 1) r.df[
4 tSNE
To plot these contexts in two dimensions, we use tSNE to reduce the 768-dimension word embeddings to two. Via Python and numpy
, we create a matrix-proper from the further
-farther
token embeddings extracted above.
import numpy as np
= r.df["emb"].values, r.df["id"].values
X, y = np.vstack(X) X
For good measure, we switch back to R to run tSNE. The matrix X, built in Python, is accessed in the R console below via reticulate::py$X
.
set.seed(999) ##
<- Rtsne::Rtsne(X = as.matrix(reticulate::py$X),
tsne check_duplicates = FALSE)
<- data.frame(reticulate::py_to_r(df), tsne$Y) %>%
tsne_clean
mutate(t1 = gsub('(further|farther)', '\\<\\1\\>', text, ignore.case = T),
t2 = stringr::str_wrap(string = t1,
width = 20,
indent = 1,
exdent = 1),
id = row_number()) %>%
select(id, form, X1, X2, t1, t2)
The scatter plot below summarizes contextual embeddings for individual tokens of further
-farther
. So, a nice space for further
used adjectivally on the right side of the plot. Other spaces less obviously structured, and some confused spaces as well where speakers seem to have quite a bit of leeway.
<- ggplot2::ggplot(tsne_clean,
p aes(x = X1,
y = X2,
color = form,
text = t2,
key = id )) +
geom_hline(yintercept = 0, color = 'gray') +
geom_vline(xintercept = 0, color = 'gray') +
geom_point(alpha = 0.5) +
theme_minimal() +
::scale_colour_economist() +
ggthemesggtitle('further-farther')
::ggplotly(p, tooltip = 'text') plotly
Warning: `gather_()` was deprecated in tidyr 1.2.0.
ℹ Please use `gather()` instead.
ℹ The deprecated feature was likely used in the plotly package.
Please report the issue at <]8;;https://github.com/plotly/plotly.R/issueshttps://github.com/plotly/plotly.R/issues]8;;>.
5 Summary
So, some notes on reticulate
and Python environments, and spacy
and BERT. While a computational beast, BERT seems fantastically suited for more fine-grained, qualitative semantic analyses and case studies, and lexicography in general.