Home Language Learning Detecting Delicate Textual content: Going Past Toxicity

Detecting Delicate Textual content: Going Past Toxicity

Detecting Delicate Textual content: Going Past Toxicity


Defending customers from dangerous communications is one thing we care about deeply at Grammarly. For that cause, our analysis has lately explored the idea of delicate, or “delicate,” textual content. That is writing that discusses emotional, doubtlessly triggering subjects.

Whereas there was numerous analysis into detecting and combating poisonous textual content in recent times, we’ve seen a niche in the case of the broader class of delicate textual content, which will not be characterised as poisonous however nonetheless presents a danger. To handle this, Grammarly researchers have developed a benchmark dataset of delicate texts, DeTexD, which we’ve used to judge the strengths and limitations of various delicate textual content detection strategies.

On this article, we summarize the outcomes from our paper, “DeTexD: A Benchmark Dataset for Delicate Textual content Detection,” by Serhii Yavnyi, Oleksii Sliusarenko, Jade Razzaghi, Olena Nahorna, Yichen Mo, Knar Hovakimyan, and Artem Chernodub. This paper was revealed by means of the Affiliation for Computational Linguistics’ seventh Workshop on On-line Abuse and Harms. A hyperlink to the paper and different artifacts, together with an ethics assertion, could be discovered on the finish of this text.

What’s delicate textual content?

We outline delicate textual content as any textual content that’s emotionally charged or doubtlessly triggering, such that partaking with it has the potential to end in hurt.

Delicate textual content will not be explicitly offensive and should include no vulgar language, nevertheless it carries a danger for customers or brokers (e.g., LLMs) which might be uncovered to it. This danger varies; some delicate texts are extremely dangerous, akin to texts about self-harm or content material that promotes violence in opposition to sure identification teams, whereas others are much less dangerous, akin to textual content that discusses a controversial political matter in a measured method.

Delicate texts cowl varied topics, with subjects starting from race, gender, and faith to psychological well being, socioeconomic standing, and political affiliations. There isn’t any clear border between delicate textual content and poisonous textual content, and it’s potential for textual content to fall into each classes. Examples of delicate texts could be present in our paper.

Listing of delicate textual content subjects

Constructing a dataset of delicate texts

To supply knowledge, we used the next strategies:

  • Area specification: We particularly focused information web sites, boards discussing delicate subjects, and customarily controversial boards.
  • Key phrase matching: We developed a dictionary of delicate key phrases together with a severity score for every key phrase. We used this dictionary to additional refine our dataset and extract delicate texts masking a wide range of subjects and ranges of danger.

The annotation of datasets containing delicate textual content presents a novel problem given its subjective nature. To mitigate this, we offered annotators with detailed examples and directions, which could be present in our paper. We additionally used a two-step annotation course of: annotators first recognized whether or not texts had been delicate or not, after which rated the danger degree of delicate texts. All annotators had been professional linguists with earlier expertise in comparable duties; the ultimate label was determined by majority vote.

The outcome was the labeled DeTexD Coaching dataset of 40,000 samples and the labeled DeTexD Benchmark dataset of 1,023 paragraphs.


With these new datasets in hand, we had been wanting to experiment with coaching a mannequin to detect delicate textual content. We started by fine-tuning a classification mannequin on the DeTexD Coaching dataset. As our base mannequin, we used the RoBERTa-based classifier (Liu et al., 2019a); extra particulars on the setup are in our paper.

Our principle was that delicate textual content detection and poisonous textual content detection are two distinct duties. To check this, we evaluated our mannequin on a number of canonical poisonous textual content datasets. Whereas there may be some overlap (hate speech is each poisonous and delicate), our mannequin tends to be extra permissive round texts that include profanities that aren’t associated to a delicate matter. Alternatively, our mannequin tends to categorise texts coping with subjects like race, violence, and sexuality as delicate, despite the fact that these texts will not be labeled as poisonous. In the end, “poisonous” and “delicate” are very completely different ideas.

As a result of variations in definitions, we hypothesized that approaches to detect hate speech will underperform on the DeTexD Benchmark dataset, in comparison with our baseline mannequin. We evaluated a number of strategies on precision, recall, and F1 rating (a metric that mixes precision and recall).

Methodology Precision Recall F1
HateBERT, AbusEval 86.7% 11.6% 20.5%
HateBERT, AbusEval# 57% 20.2% 62.9%
HateBERT, HatEval 95.2% 6.0% 11.2%
HateBERT, HatEval# 41.1% 86.0% 55.6%
HateBERT, OffensEval 75.4% 31.0% 43.9%
HateBERT, OffensEval# 60.1% 72.6% 65.8%
Google’s Perspective API 77.2% 29.2% 42.3%
OpenAPI content material filter 55.0% 64.0% 58.9%
OpenAPI moderation API 91.3% 18.7% 31.1%
Our baseline mannequin 81.4% 78.3% 79.3%

Efficiency of assorted strategies on the DeTexD Benchmark analysis dataset. After receiving useful suggestions from reviewers, we calibrated optimum thresholds for F-score for a number of strategies (marked with #).

As strategies apply a extra broad and permissive definition of toxicity, they have an inclination to carry out higher at detecting delicate texts. OffensEval from Caselli et al., 2021, for instance, makes use of this definition: “comprises any type of non-acceptable language (profanity) or a focused offense, which could be veiled or direct.” Google’s Perspective API has a comparable definition that features a variety of classes.

Nonetheless, not one of the poisonous language detection strategies that we studied carry out very properly at detecting delicate textual content. The evaluated strategies both miss protection, particularly on medical and psychological well being subjects, present decrease precision on examples that include offensive key phrases (however aren’t deemed delicate in line with our definition), or each.

We hope that datasets like DeTexD may also help result in safer AI by broadening how and the place we search for doubtlessly dangerous content material.

Conclusion and ethics assertion

Hate speech, offensive language, and delicate texts are delicate and crucial issues. On this analysis, we’ve tried to dive deeper into the challenges and alternatives of delicate textual content detection.

We’ve made our annotation tips, annotated benchmark dataset, and baseline mannequin publicly out there. Please be aware that we don’t suggest utilizing these artifacts with out correct due diligence for privateness, safety, sensitivity, authorized, and compliance measures.

As a result of nature of the subject material, the DeTexD Benchmark dataset contains a wide range of uncensored delicate content material, akin to hate speech, violence, threats, self-harm, psychological well being, sexuality, and profanity. Our paper contains key phrases and partial textual content examples of the identical kind. Essentially the most excessive occurrences of such examples within the paper are partially obscured with asterisks, however the semantics are retained.

You will discover extra details about Grammarly’s analysis staff right here. If our work feels like a very good match for you, think about making use of to hitch—we’re hiring!

Hyperlink to paper: DeTexD: A Benchmark Dataset for Delicate Textual content Detection




Please enter your comment!
Please enter your name here