Bots and virtual moderators will prevent cyberbullying
— For several years, we have been developing the field of AI called neuro-symbolic artificial intelligence. We combine machine learning with reasoning-based symbolic processing. This allows us to maximize the precision of detection. The goal is for AI to make as few mistakes as possible. If we truly want to prevent online harm, the system must operate autonomously, making independent decisions about whether to block a message or send a real-time intervention, say the co-founders of Samurai Labs, Gniewosz Leliwa and Patrycja Tempska.
Patrycja Tempska - impact co-founder of Samurai Labs. She conducts research on methods to prevent online violence. Together with the educational and support platform Życie Warte Jest Rozmowy (Life is Worth a Conversation), they create the One Life project, where they reach out to people in crisis using neuro-symbolic algorithms that analyze hundreds of millions of online conversations. Co-author of patents and scientific publications in the field of artificial intelligence and social sciences, with a background in philosophy. Included in the Top 100 Women in AI in 2022.
Gniewosz Leliwa - CTO and co-founder of Samurai Labs. Co-creator of AI solutions protecting millions of internet users worldwide from cyberbullying. A theoretical physicist with a background in quantum field theory, who abandoned his PhD to work on artificial intelligence. Author of numerous patents and scientific publications in the field of neuro-symbolic AI and its applications in detecting and preventing phenomena such as cyberbullying, suicidal ideation, and child grooming. Co-founder of Fido.AI and co-creator of natural language understanding technology recognized in the CB Insights' TOP 100 AI ranking and Gartner’s Cool Vendor.
What does Samurai Labs specialize in?
Patrycja Tempska: Our mission is to promote the well-being of online communities by detecting and preventing various harmful phenomena on the Internet. This includes cyberbullying, racism and gender-based personal attacks, sexism, blackmail, or threats. Recently, we've also been focusing on detecting suicidal intentions, thoughts, or declarations. We reach out to people in suicidal crisis by offering them a caring intervention that includes relevant self-help materials and places where they can seek help. We co-create the system in close collaboration with a team of experts - suicidologists from the educational and support platform Życie Warte Jest Rozmowy (Life is Worth a Conversation). These are people who specialize in the study of suicidal behavior, provide support to people in crisis, and work daily in the area of suicide prevention.
In the context of cyberbullying, when such issues arise within a community, depending on the community's rules and the phenomenon’s severity, we can take various actions. In some cases, written communications are blocked before they reach the user to prevent damage before it's done. In other cases, interventions might be sent to positively model online discussions. These messages are designed based on social sciences, psychology, and philosophy. All these efforts aim to educate users and promote desired communication norms. As our research shows, these actions result in a reduced number of attacks within specific online communities.
Based on what data do you detect such phenomena?
Gniewosz Leliwa: Essentially, any textual communication is relevant. If we're talking about platforms like Discord, chats, or online games, we analyze short text messages that users exchange with one another. For forums and sites like Reddit, longer forms of expression as well as comments on these platforms are subject to analysis. On Twitch, the system processes chat messages during streaming. We can also transcribe audio and video files, but our primary focus is on analyzing and processing natural language in texts.
Let's talk about technologies. What solutions do you use to detect online violence?
GL: For several years, we have been developing the field of AI called neuro-symbolic artificial intelligence. We combine machine learning with reasoning-based symbolic processing. This allows us to maximize the precision of detection. The goal is for AI to make as few mistakes as possible. If we truly want to prevent online harm, the system must operate autonomously, making independent decisions about whether to block a message or send a real-time intervention.
In our approach, the symbolic system controls machine learning components. By doing so, machine learning "understands" language better, and symbolic reasoning prevents statistical components from making common errors. For example, an overly sensitive model to profanity might start detecting it as hate speech or personal attacks.
GL: If it isn’t, it would react to things it shouldn't. If someone uses a profane word, not to offend anyone but to emphasize emotions, such interventions, warnings, or blocks could be faced with disapproval within the communities we work with. It's like a bouncer in a club throwing out people who are just having a good time.
How many people work in your team?
GL: Over 20 people are involved in developing and implementing our models. This work is carried out by three engineering teams. The first is the product team, which "wraps" all the models we create into APIs, builds configuration systems, analytical panels, and moderation queues. The second and third are AI teams, one more focused on the symbolic and reasoning aspect, and the other on machine learning.
What is the process of working with data at Samurai Labs like?
GL: As I mentioned, the input data primarily consists of real conversations from various types of chats, forums, and other online communities. We utilize open sources like Reddit, but when possible, we also use data from partners or clients. All collected data is sent to annotation, where specifically trained annotators mark fragments containing the phenomena searched for by the model, using a dedicated tool and following pre-prepared instructions. These phenomena include, for example, personal attacks or suicidal thoughts.
We've built our own team of over 20 annotators, whom we try to recruit from people with experience in psychology and pedagogy. We also pay attention to geographic diversity (part of the team comes from South America) and familiarity with the specific topic. For instance, we try to have gamers annotate content from online games. We developed the entire annotation framework ourselves. Initially, we tried to use available datasets, but it quickly turned out that, unfortunately, they were not of the quality we needed.
We also use artificial intelligence in the annotation process itself. We've created a so-called virtual annotator, a special AI model whose decisions are compared with those of human annotators. This allows us to detect even slight differences and re-annotate such examples.
What does training such AI models look like?
GL: When we start a new project, we create annotation manuals in collaboration with experts from the specific field, such as suicidology, and the AI team. Instructions are then updated multiple times to capture and include all nuances. Then, the data annotation process begins, along with the training of the initial models that assist in selecting cases for the subsequent rounds of annotation.
Each annotation is done in a 3+1 model at least. This means that three independent annotators evaluate each message, and then a superannotator makes the final decision on disputed cases. When a problem arises that should be included in the instruction, it gets updated. We place great importance on data quality because, as we know, a machine learning model is only as good as the data it was trained on.
The annotated data goes to both AI teams, and the work on the final models begins.
What tools do you use?
GL: When it comes to the symbolic system and its integration with machine learning, this is our proprietary approach and solution. We've built our own framework and hold patents in this area.
Regarding machine learning itself, we use transformers and large language models (LLMs). We primarily use libraries like Transformers (Hugging Face), Torch, and Sklearn. For neural network model quantization, we use ONNX. We log experiments using MLFlow and automate processes with DVC. Our environment for running experiments is SageMaker, and for prototyping we use Jupyter Notebook.
What does using models in practice look like? What challenges do you face during data analysis and subsequent detection?
GL: An interesting aspect is using large language models to filter out false positives. We can do that once we have a functioning detection model and want to consider a broader context of the conversation. Imagine a forum post discussing a crime, and users' comments are not favorably directed at the criminal. Normally, the system might react to those comments, "thinking" they are targeting another forum user. However, thanks to the broader context, the system can decide not to react.
What about the effectiveness of the models?
GL: All of our production models have a precision level of at least 95 percent. This is the main parameter we are interested in, because these models operate autonomously, without human intervention. In the case of competitive solutions, even half of the results returned are false positives.
Every community is different. How do you generate a tailor-made detection model?
GL: Moderating channels for adults, where users don't want any censorship and only aim to maintain a certain level of discussion, should definitely look different from moderating channels for children. In the latter case, we want to eliminate all potentially harmful content, including profanity or discussions about sensitive topics. At Samurai Labs, we adopt a compositional approach. We break down every large problem into smaller ones, like cyberbullying, which we divide into personal attacks, sexual harassment, rejection, threats, or blackmail. Each of these smaller issues is then broken down even further. In this way, we build narrow and highly precise models that are easy to develop and maintain. They also handle ML model biases much better, especially end-to-end ones that attempt to solve large and complex problems like hate speech or cyberbullying.
How does the Samurai Cyber Guardian work?
GL: It's a system designed to create and implement an entire moderation workflow tailored to a specific online community. The product consists of several components. We have AI models responsible for detecting specific phenomena and a "control center" that users (e.g., moderators) log into. The control center includes various tools and panels. The configuration panel allows users to decide how the system should react automatically and in what manner, and what should be subject to manual moderation. The moderation queue handles cases for manual moderation. Analytical panels allow users to track the system's performance and observe changes in user behavior and the overall level of violence within a given community. The product is delivered as an API and can be used to control a moderation bot or any other moderation system. We also offer direct integrations with platforms and services like Discord or Twitch.
Content moderation on a forum is like working with a living organism that evolves in real-time. Are your systems updated?
GL: We operate under the assumption that it's not possible to build a model that will work always and everywhere, which is similar to antivirus systems. Our models are updated on average once every two weeks - we collect logs, analyze data, and based on that, make adjustments to the models.
A classic method of evading detection is using "leet speak," which involves replacing letters with similar-looking symbols, such as replacing "S" with a dollar sign. Our system is also highly resilient to this technique, partly due to the neuro-symbolic approach. Furthermore, if users know that a moderation system is AI-driven, they're more likely to try to cheat it. But the more creative the user, the better our system learns to handle such attempts to bypass the system.
PT: One example involves comments exchanged by teenagers on the anonymous Formspring forum. Today, the site is closed due to widespread cyberbullying that led to several suicide attempts by young individuals. Some comments marked by people as neutral, when processed by our system, turned out to be veiled attacks using leet speak.
Algorithms in the fight against cyberbullying are one thing, but are you also trying to educate communities?
PT: In addition to detecting cyberbullying using neuro-symbolic algorithms, research on the utilization of these methods is crucial. We conduct research to create comprehensive strategies for online communities, where artificial intelligence is used not only to detect cyberbullying but also to proactively prevent it. We explore different strategies for responding to users' comments with the aim of reducing the number of personal attacks. One such study we conducted took place on an English-language Reddit forum. We created a bot named James, equipped with personal attack detection models and a system to generate various interventions that appealed to empathy or specific norms. Whenever someone attacked users involved in a discussion, James detected the attack in real-time and responded with one of the messages, such as "Hey, most of us address each other here with respect." Such comments alone were enough for James, in one of the more radicalized Reddit communities, to reduce the level of attacks by 20%.
GL: It's worth mentioning that the user didn't know they were interacting with a bot. Our James presented himself as a regular forum user and had his own activity history and background. His interventions had to look natural and not repetitive.
PT: Exactly. That's why the number of unique interventions reached over 100,000, all created based on a dozen or so basic statements. This study, along with many others conducted by us and other institutions, shows that at the intersection of artificial intelligence, social sciences, and data science, we can empirically validate the effectiveness of specific methods to counter harmful phenomena and maximize their positive social impact.
Does this have a financial dimension for your clients?
PT: Impact indicators (related to positive social impact) are important, but so are the business indicators. After all, we implement our solutions in communities whose owners want to generate income. It turns out that impact indicators are linked to business ones through engagement. About 10 years ago, there was a belief that more aggression implied more user engagement - more comments, clicks, etc. Today, we have evidence to the contrary. According to a study by Riot Games, League of Legends players who experienced toxic interactions upon their first exposure to the game were over three times less likely to return to the game compared to those who didn't encounter such content.
In Samurai, in one of our observational studies based on around 200,000 comments on Reddit, we showed that attacks significantly reduce the activity of the attacked individuals. We used traditional statistical methods and Bayesian estimation.
Awareness of the social problem grows proportionally to the motivation related to taking care of the well-being of online communities. This is connected with a range of negative behavioral and psychological consequences associated with the experience of cyberbullying, which are increasingly being researched and described.
GL: Additionally, the legal landscape is changing. Take the suicidal ideation detection module, for instance. When we started working on it in 2020, it was still a taboo topic. The average parent could believe that their child might encounter harmful online behavior, such as a pedophile attack, but most parents couldn't even conceive that their child could commit suicide, partly due to exposure to online content related to self-harm or suicide. Thanks to legal regulations, this awareness is growing.
PT: Here, one of the catalysts for changes in social media policies and the development of new legislation is an example from the UK. It involves the widely publicized case of Molly Russell, a 14-year-old who took her own life after being exposed to content related to suicidal behavior, the visibility of which was amplified by social media algorithms.
What does the future hold for the systems you create? Are you moving towards full autonomy in decision-making?
GL: Samurai Labs is a pioneer when it comes to prevention and autonomous content moderation. I think it's a natural direction, and the entire industry will strongly lean towards it. If the response comes long after someone has been attacked, they're already a victim, and others may have read the harmful content, resulting in harm done. Harmful phenomena should be detected as quickly as possible and responded to immediately.
There's also the issue of data access in terms of legislation. When it comes to detecting suicidal content, messages or farewell letters are often sent through public forums. In the case of pedophilic attacks, private communication is often involved, and the offender aims to quickly transition to encrypted channels.
I think a middle ground will be autonomous systems where artificial intelligence analyzes the content being sent, and there won't be a need for anyone to read those messages. If AI detects something troubling, it will react by blocking the communication of that predator and inform the site owners or law enforcement.
PT: Shifting the paradigm to operate automatically, without human involvement or with partial human intervention, will allow us to prevent numerous negative consequences of cyberbullying. It's important to remember that today, there's a heavy burden mainly put on moderators. Machine learning-based systems don't operate automatically for the most part; they flag posts for further verification by a human who makes the final decision whether to remove the post or not.
In making these decisions, they encounter extremely drastic content every day, with the cruelty of the world we as humanity generate. It all rests on them. That's why we see a great need to relieve moderators, allowing them to focus on positively increasing user engagement within online platforms.