ActiveFence Integration#

NeMo Guardrails supports using the ActiveFence ActiveScore API as an input and output rail out-of-the-box (you need to have the ACTIVEFENCE_API_KEY environment variable set).

rails:
  input:
    flows:
      # The simplified version
      - activefence moderation on input

      # The detailed version with individual risk scores
      # - activefence moderation on input detailed

The activefence moderation on input flow uses the maximum risk score with an 0.85 threshold to decide if the text should be allowed or not (i.e., if the risk score is above the threshold, it is considered a violation). The activefence moderation on input detailed has individual scores per category of violation.

To customize the scores, you have to overwrite the default flows in your config. For example, to change the threshold for activefence moderation on input you can add the following flow to your config:

define subflow activefence moderation on input
  """Guardrail based on the maximum risk score."""
  $result = execute call activefence api

  if $result.max_risk_score > 0.85
    bot inform cannot answer
    stop

ActiveFence’s ActiveScore API gives flexibility in controlling the behavior of various supported violations individually. To leverage that, you can use the violations dictionary (violations_dict), one of the outputs from the API, to set different thresholds for different violations. Below is an example of one such input moderation flow:

define flow activefence input moderation detailed
  $result = execute call activefence api

  if $result.violations.get("abusive_or_harmful.hate_speech", 0) > 0.8
    bot inform cannot engage in abusive or harmful behavior
    stop

define bot inform cannot engage in abusive or harmful behavior
  "I will not engage in any abusive or harmful behavior."