Child Safety Necessitates New Approaches to AI Safety

Open Problems

15 research problems across three phases of the AI lifecycle. Filter by challenge type to explore.

Developing Safe Models by Design

Problems 1–5 · Training phase

Partial data cleaning

Data Evaluation

How does data cleaning affect a generative model’s ability to depict a concept? To what degree can partial cleaning guarantee that a model cannot depict CSAM?

Guaranteeing complete CSAM removal is difficult due to imperfect detection technology, scale of data, and moderator wellness concerns. Determining whether imperfect removal can prevent text-to-image models from learning to generate CSAM is an important open question.

Learn more

Existing Work

LLM and diffusion studies show that filtering training data can minimize harmful capabilities, and research in text-to-image generation shows a critical number of training samples are required for concept composition.

Limitations

AIG-CSAM generation requires strong safety guarantees. While existing work is a starting point, the problem would benefit from formal guarantees on a model’s ability to generate harmful content. Because it is illegal for researchers to generate CSAM, it is unclear how to exhaustively test models’ capabilities.

Preventing concept fusion

Data Evaluation Adversarial

Can generative models be selectively blocked from combining high-risk concepts, such as children and NSFW material?

Harmful content can be created by composing two potentially benign concepts. Novel architectures and training paradigms are needed to selectively prevent unwanted fusion.

Learn more

Existing Work

Prior work includes retrieval-based architectures for revocable sensitive data access and classifiers that self-identify concepts to generate each output.

Limitations

Most concept-based explainability work targets classification and text generation, not image generation. Mechanisms relying on isolated sensitive datastores face challenges in reliably classifying sensitive imagery and may sacrifice quality on benign concepts.

Resilience to harmful fine-tuning

Data Evaluation Adversarial

How can generative models be post-trained to prevent fine-tuning on CSAM or simultaneously fine-tuning on multiple CSAM-related concepts?

Even if base models appear safe, users can unlock harmful capabilities through post-training. CSAM perpetrators use GUI-based LoRA fine-tuning software to locally fine-tune open source models on CSAM.

Learn more

Existing Work

Self-destructing classifiers that degrade when fine-tuned on specific tasks have been proposed, with similar approaches explored for diffusion models.

Limitations

Solutions for self-destructing models that obstruct joint fine-tuning on separate concepts are largely unexplored. Strategies to prevent harmful LoRA fine-tuning are lacking. Methods that build resilience to harmful fine-tuning often require training on obstructed tasks, which CSAM data access restrictions make challenging.

Minimizing human exposure

Adversarial Wellness

How does exposure to AIG-CSAM affect red teamers? How can human exposure be minimized?

The emotional and mental toll of CSAM exposure is well documented across content moderation, law enforcement, and data labeling, with additional wellness implications for red teamers.

Learn more

Existing Work

Frameworks for automated red teaming text-to-image models have been proposed, and industry content review tools incorporate wellness features like image blurring.

Limitations

Text-to-image models are particularly vulnerable to out-of-distribution prompts, requiring robust human red teaming. Offenders quickly discover new jailbreaking mechanisms. Evaluating wellness features requires sociotechnical studies with industry and red teaming providers.

Watermark robustness

Data Evaluation Adversarial

How can text-to-image watermarks be made robust to removal, spoofing, and partial edits?

Offenders actively combat safety interventions, including stripping metadata and watermarks from generated images.

Learn more

Existing Work

Robust watermarking techniques have been explored. Methods like Stable Signature incorporate the watermark directly in model weights during pretraining.

Limitations

Many schemes remain vulnerable to simple attacks. Watermarks can be removed by partially modifying an image or editing code in generation scripts. Techniques that are more robust to fine-tuning and erasure tend to be more vulnerable to being stolen and spoofed. Solutions that include a history of partial edits are also lacking.

Deployment Safeguards

Problems 6–10 · Release and distribution phase

Effective prompt/output detection

Data Evaluation Adversarial Wellness

How can harmful prompts/outputs be reliably detected, in an adversarial ecosystem with limited access to in-distribution examples?

Adversaries jailbreak systems by disguising harmful signals; text-to-image models are particularly vulnerable. Open source deployments face additional challenges as content moderation is not feasible and safety filters are vulnerable.

Learn more

Existing Work

Existing solutions include instruction fine-tuning on similar prompts, automated red teaming tools, zero-shot prompt classifiers, and guideline-based model training to reject unsafe prompts.

Limitations

Red teaming prompts and guidelines may not reflect actual CSAM offender behavior. Most organizations lack CSAM access; accurate classifiers require offender data. No public benchmark exists to evaluate solutions. Research to strengthen solutions in the open source setting is broadly lacking.

Automated model assessment

Evaluation Adversarial Wellness

How can models be assessed for AIG-CSAM capabilities and CSAM training data automatically?

Third-party models are not always assessed for CSAM risks pre-deployment. Scalable assessment is needed to enforce platform policies and prevent distribution of models with AIG-CSAM capabilities.

Learn more

Existing Work

Training data extraction techniques have been applied to detect specific media in training data. Mechanistic interpretability aims to examine learned weights, with automated circuit discovery identifying computational subgraphs for specific behaviors.

Limitations

Using mechanistic interpretability to audit CSAM generation capabilities is unexplored. Current training data extraction techniques rely on prompting for AIG-CSAM, which violates US law. Text-to-video assessment is broadly unexplored.

Standardized safety assessments

Evaluation Adversarial Wellness

How can we standardize assessments for AIG-CSAM capabilities?

Standardized safety assessments allow for consistent and transparent model evaluation. Building confidence and trust in evaluations requires assurance that assessment is robust and not unduly influenced by external incentives.

Learn more

Existing Work

External red teaming and benchmarking are standard for assessing generative models. External domain expertise can help discover novel issues, and benchmarking supports scalable reproducibility.

Limitations

Safety benchmarking may correlate with model capabilities rather than actual safety. Fundamental gaps exist in AI safety assessments for non-text modalities. Static benchmarks quickly become outdated as offenders develop new strategies.

Model transparency

Data Adversarial

How can we use model cards to encourage transparency without inadvertently enabling offenders?

Model cards documenting child safety interventions create a natural pause point for developers to assess safeguards, but transparency can be a double-edged sword.

Learn more

Existing Work

Model cards provide fair assessment on bias and other factors. Card format can influence interpretability for non-technical audiences. The process of filling out model cards can elicit further ethical consideration from developers.

Limitations

Disclosing safety interventions is only effective if deployment is contingent on implementing them; currently, models without CSAM safeguards can still be released. Transparency enables offenders to discover vulnerable models.

Hobbyist ecosystem

Data Adversarial Wellness

How can we encourage adoption of safety best practices across the diverse ML hobbyist ecosystem?

AI safety efforts typically target industry providers rather than hobbyists who build on foundation models. This leaves a gap: downstream actors may lack the resources, incentives, or oversight to maintain CSAM safeguards.

Learn more

Existing Work

AI developers broadly recognize ethical dilemmas but often lack resources and training. Education-based prevention of child sexual abuse is well-studied. Research on hobbyist developers highlights intellectual stimulation as a primary motivation.

Limitations

Education efforts to promote developer awareness and CSAM prevention practices remain understudied. Researching online communities carries risks of harassment and doxing, compounded by the high-stakes nature of child sexual abuse.

Safe Model Maintenance

Problems 11–15 · Post-deployment monitoring and response

Identifying abliterated models

Evaluation Adversarial

How can we identify models and services that have been optimized for CSAM and “nudification”?

With thousands of models, apps, and services created and uploaded daily, rapid identification of those optimized for harmful purposes remains a critical gap.

Learn more

Existing Work

Model fingerprinting uses adversarial attacks to compare outputs for IP protection. Model diffing exploits mechanistic differences to identify model-specific concepts. AIG-CSAM model hashlists can be built using cryptographic hashing.

Limitations

Model fingerprinting requires insight into the “original” model. Model diffing is not well explored for text-to-image models or LoRA fine-tuning. Cryptographic hashing detects exact replicas but not minor modifications.

Robust unlearning

Data Adversarial

How can we reliably erase the concept of CSAM from generative models?

CSAM might be found in training data after deployment, and concept fusion opens additional pathways. The gold standard is retraining from scratch, but this is prohibitively expensive and time-consuming.

Learn more

Existing Work

Approximate machine unlearning and concept erasure for text-to-image models have been explored, removing knowledge with only a textual description of the harmful concept.

Limitations

These methods are vulnerable to adversarial prompts and only provide probabilistic guarantees. CSAM requires exact unlearning, providing strict guarantees that CSAM images have no effect on model output.

Protecting user imagery

Evaluation Adversarial

How do we proactively protect users’ imagery from unwanted AI-generated manipulation?

Model-level solutions are necessary but don’t afford end users or platforms hosting user-generated content agency to proactively protect their own content.

Learn more

Existing Work

Image immunization involves injecting imperceptible perturbations so that editing software fails. Some recent efforts specifically focus on protecting children’s imagery. In IP protection, similar solutions disrupt style mimicry.

Limitations

Research indicates these perturbation strategies are not robust to simple attacks such as image upscaling. Video protection solutions are lacking. Evaluating these techniques for children’s imagery has ethical and legal implications.

Securing AI agents

Data Adversarial Wellness

How can we prevent the misuse of AI agents and code generation to facilitate child sexual abuse?

While criminal actors haven’t yet adopted AI agents for child sexual exploitation, the tools are already used in other criminal enterprises. AI agents capable of relationship building could enable extortion schemes; code generation could automate “nudifying” software creation.

Learn more

Existing Work

Safety for agentic systems is an emerging field. Current work focuses on early detection and prevention of misuse, with most solutions relying on training models using examples of prior misuse such as AI-generated malware.

Limitations

Developing “nudification” code or extortion prompts is ethically ambiguous, raising data legality and researcher wellness concerns. Robust evidence of misuse in child safety contexts may be a prerequisite for industry prioritization.

Assessing safeguards

Evaluation Adversarial Wellness

How do third-party auditors and users effectively assess the efficacy of implemented safeguards?

Even where safeguards have been implemented, assessing their effectiveness — individually and within the broader system context — is necessary for building trust and transparency.

Learn more

Existing Work

Most AI safety assessments focus on individual models through red teaming or benchmarks.

Limitations

Mechanisms for assessing complex AI systems are lacking. Sociotechnical assessments accounting for actual user engagement, offender behavior, and cross-platform dynamics remain uncommon. Companies may lack incentive to provide the access necessary for these studies.

Child Safety NecessitatesNew Approaches to AI Safety

Key Challenges

Data Restrictions

Evaluation Restrictions

Adversarial Environment

Wellness Implications

Open Problems

Developing Safe Models by Design

Partial data cleaning

Existing Work

Limitations

Preventing concept fusion

Existing Work

Limitations

Resilience to harmful fine-tuning

Existing Work

Limitations

Minimizing human exposure

Existing Work

Limitations

Watermark robustness

Existing Work

Limitations

Deployment Safeguards

Effective prompt/output detection

Existing Work

Limitations

Automated model assessment

Existing Work

Limitations

Standardized safety assessments

Existing Work

Limitations

Model transparency

Existing Work

Limitations

Hobbyist ecosystem

Existing Work

Limitations

Safe Model Maintenance

Identifying abliterated models

Existing Work

Limitations

Robust unlearning

Existing Work

Limitations

Protecting user imagery

Existing Work

Limitations

Securing AI agents

Existing Work

Limitations

Assessing safeguards

Existing Work

Limitations

Calls to Action

Model Development

Deployment

Maintenance

Model Development

Deployment

Maintenance

Model Development

Deployment

Maintenance

Cite This Work

Child Safety Necessitates
New Approaches to AI Safety