In [1]:
# Delete this cell to re-enable tracebacks
import sys
ipython = get_ipython()
def hide_traceback(exc_tuple=None, filename=None, tb_offset=None,
exception_only=False, running_compiled_code=False):
etype, value, tb = sys.exc_info()
value.__cause__ = None # suppress chained exceptions
return ipython._showtraceback(etype, value, ipython.InteractiveTB.get_exception_only(etype, value))
ipython.showtraceback = hide_traceback
In [2]:
# JSON output syntax highlighting
from __future__ import print_function
from pygments import highlight
from pygments.lexers import JsonLexer, TextLexer
from pygments.formatters import HtmlFormatter
from IPython.display import display, HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
def json_print(inpt):
string = str(inpt)
formatter = HtmlFormatter()
if string[0] == '{':
lexer = JsonLexer()
else:
lexer = TextLexer()
return HTML('<style type="text/css">{}</style>{}'.format(
formatter.get_style_defs('.highlight'),
highlight(string, lexer, formatter)))
globals()['print'] = json_print
The Environment has a function for checking if two STIX Objects are semantically equivalent. For each supported object type, the algorithm checks if the values for a specific set of properties match. Then each matching property is weighted since every property doesn't represent the same level of importance for semantic equivalence. The result will be the sum of these weighted values, in the range of 0 to 100. A result of 0 means that the the two objects are not equivalent, and a result of 100 means that they are equivalent.
TODO: Add a link to the committee note when it is released.
There are a number of use cases for which calculating semantic equivalence may be helpful. It can be used for echo detection, in which a STIX producer who consumes content from other producers wants to make sure they are not creating content they have already seen or consuming content they have already created.
Another use case for this functionality is to identify identical or near-identical content, such as a vulnerability shared under three different nicknames by three different STIX producers. A third use case involves a feed that aggregates data from multiple other sources. It will want to make sure that it is not publishing duplicate data.
Below we will show examples of the semantic equivalence results of various objects. Unless otherwise specified, the ID of each object will be generated by the library, so the two objects will not have the same ID. This demonstrates that the semantic equivalence algorithm only looks at specific properties for each object type.
Please note that you will need to install a few extra dependencies in order to use the semantic equivalence functions. You can do this using:
pip install stix2[semantic]
For Attack Patterns, the only properties that contribute to semantic equivalence are name
and external_references
, with weights of 30 and 70, respectively. In this example, both attack patterns have the same external reference but the second has a slightly different yet still similar name.
In [3]:
import stix2
from stix2 import AttackPattern, Environment, MemoryStore
env = Environment(store=MemoryStore())
ap1 = AttackPattern(
name="Phishing",
external_references=[
{
"url": "https://example2",
"source_name": "some-source2",
},
],
)
ap2 = AttackPattern(
name="Spear phishing",
external_references=[
{
"url": "https://example2",
"source_name": "some-source2",
},
],
)
print(env.semantically_equivalent(ap1, ap2))
Out[3]:
For Campaigns, the only properties that contribute to semantic equivalence are name
and aliases
, with weights of 60 and 40, respectively. In this example, the two campaigns have completely different names, but slightly similar descriptions. The result may be higher than expected because the Jaro-Winkler algorithm used to compare string properties looks at the edit distance of the two strings rather than just the words in them.
In [4]:
from stix2 import Campaign
c1 = Campaign(
name="Someone Attacks Somebody",)
c2 = Campaign(
name="Another Campaign",)
print(env.semantically_equivalent(c1, c2))
Out[4]:
For Identities, the only properties that contribute to semantic equivalence are name
, identity_class
, and sectors
, with weights of 60, 20, and 20, respectively. In this example, the two identities are identical, but are missing one of the contributing properties. The algorithm only compares properties that are actually present on the objects. Also note that they have completely different description properties, but because description is not one of the properties considered for semantic equivalence, this difference has no effect on the result.
In [5]:
from stix2 import Identity
id1 = Identity(
name="John Smith",
identity_class="individual",
description="Just some guy",
)
id2 = Identity(
name="John Smith",
identity_class="individual",
description="A person",
)
print(env.semantically_equivalent(id1, id2))
Out[5]:
For Indicators, the only properties that contribute to semantic equivalence are indicator_types
, pattern
, and valid_from
, with weights of 15, 80, and 5, respectively. In this example, the two indicators have patterns with different hashes but the same indicator_type and valid_from. For patterns, the algorithm currently only checks if they are identical.
In [6]:
from stix2.v21 import Indicator
ind1 = Indicator(
indicator_types=['malicious-activity'],
pattern_type="stix",
pattern="[file:hashes.MD5 = 'd41d8cd98f00b204e9800998ecf8427e']",
valid_from="2017-01-01T12:34:56Z",
)
ind2 = Indicator(
indicator_types=['malicious-activity'],
pattern_type="stix",
pattern="[file:hashes.MD5 = '79054025255fb1a26e4bc422aef54eb4']",
valid_from="2017-01-01T12:34:56Z",
)
print(env.semantically_equivalent(ind1, ind2))
Out[6]:
If the patterns were identical the result would have been 100.
For Locations, the only properties that contribute to semantic equivalence are longitude
/latitude
, region
, and country
, with weights of 34, 33, and 33, respectively. In this example, the two locations are Washington, D.C. and New York City. The algorithm computes the distance between two locations using the haversine formula and uses that to influence equivalence.
In [7]:
from stix2 import Location
loc1 = Location(
latitude=38.889,
longitude=-77.023,
)
loc2 = Location(
latitude=40.713,
longitude=-74.006,
)
print(env.semantically_equivalent(loc1, loc2))
Out[7]:
For Malware, the only properties that contribute to semantic equivalence are malware_types
and name
, with weights of 20 and 80, respectively. In this example, the two malware objects only differ in the strings in their malware_types lists. For lists, the algorithm bases its calculations on the intersection of the two lists. An empty intersection will result in a 0, and a complete intersection will result in a 1 for that property.
In [8]:
from stix2 import Malware
MALWARE_ID = "malware--9c4638ec-f1de-4ddb-abf4-1b760417654e"
mal1 = Malware(id=MALWARE_ID,
malware_types=['ransomware'],
name="Cryptolocker",
is_family=False,
)
mal2 = Malware(id=MALWARE_ID,
malware_types=['ransomware', 'dropper'],
name="Cryptolocker",
is_family=False,
)
print(env.semantically_equivalent(mal1, mal2))
Out[8]:
For Threat Actors, the only properties that contribute to semantic equivalence are threat_actor_types
, name
, and aliases
, with weights of 20, 60, and 20, respectively. In this example, the two threat actors have the same id properties but everything else is different. Since the id property does not factor into semantic equivalence, the result is not very high. The result is not zero because of the "Token Sort Ratio" algorithm used to compare the name
property.
In [9]:
from stix2 import ThreatActor
THREAT_ACTOR_ID = "threat-actor--8e2e2d2b-17d4-4cbf-938f-98ee46b3cd3f"
ta1 = ThreatActor(id=THREAT_ACTOR_ID,
threat_actor_types=["crime-syndicate"],
name="Evil Org",
aliases=["super-evil"],
)
ta2 = ThreatActor(id=THREAT_ACTOR_ID,
threat_actor_types=["spy"],
name="James Bond",
aliases=["007"],
)
print(env.semantically_equivalent(ta1, ta2))
Out[9]:
For Tools, the only properties that contribute to semantic equivalence are tool_types
and name
, with weights of 20 and 80, respectively. In this example, the two tools have the same values for properties that contribute to semantic equivalence but one has an additional, non-contributing property.
In [10]:
from stix2 import Tool
t1 = Tool(
tool_types=["remote-access"],
name="VNC",
)
t2 = Tool(
tool_types=["remote-access"],
name="VNC",
description="This is a tool"
)
print(env.semantically_equivalent(t1, t2))
Out[10]:
For Vulnerabilities, the only properties that contribute to semantic equivalence are name
and external_references
, with weights of 30 and 70, respectively. In this example, the two vulnerabilities have the same name but one also has an external reference. The algorithm doesn't take into account any semantic equivalence contributing properties that are not present on both objects.
In [11]:
from stix2 import Vulnerability
vuln1 = Vulnerability(
name="Heartbleed",
external_references=[
{
"url": "https://example",
"source_name": "some-source",
},
],
)
vuln2 = Vulnerability(
name="Heartbleed",
)
print(env.semantically_equivalent(vuln1, vuln2))
Out[11]:
In [12]:
print(env.semantically_equivalent(ind1, vuln1))
Some object types do not have a defined method for calculating semantic equivalence and by default will give a warning and a result of zero.
In [13]:
from stix2 import Report
r1 = Report(
report_types=["campaign"],
name="Bad Cybercrime",
published="2016-04-06T20:03:00.000Z",
object_refs=["indicator--a740531e-63ff-4e49-a9e1-a0a3eed0e3e7"],
)
r2 = Report(
report_types=["campaign"],
name="Bad Cybercrime",
published="2016-04-06T20:03:00.000Z",
object_refs=["indicator--a740531e-63ff-4e49-a9e1-a0a3eed0e3e7"],
)
print(env.semantically_equivalent(r1, r2))
Out[13]:
By default, comparing objects of different spec versions will result in a ValueError
.
In [14]:
from stix2.v20 import Identity as Identity20
id20 = Identity20(
name="John Smith",
identity_class="individual",
)
print(env.semantically_equivalent(id2, id20))
You can optionally allow comparing across spec versions by providing a configuration dictionary using ignore_spec_version
like in the next example:
In [15]:
from stix2.v20 import Identity as Identity20
id20 = Identity20(
name="John Smith",
identity_class="individual",
)
print(env.semantically_equivalent(id2, id20, **{"_internal": {"ignore_spec_version": True}}))
Out[15]:
In [16]:
import logging
logging.basicConfig(format='%(message)s')
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
ta3 = ThreatActor(
threat_actor_types=["crime-syndicate"],
name="Evil Org",
aliases=["super-evil"],
)
ta4 = ThreatActor(
threat_actor_types=["spy"],
name="James Bond",
aliases=["007"],
)
print(env.semantically_equivalent(ta3, ta4))
logger.setLevel(logging.ERROR)
Out[16]:
You can also retrieve the detailed results in a dictionary so the detailed results information can be accessed and used more programatically. The semantically_equivalent() function takes an optional third argument, called prop_scores
. This argument should be a dictionary into which the detailed debugging information will be stored.
Using prop_scores
is simple: simply pass in a dictionary to semantically_equivalent()
, and after the function is done executing, the dictionary will have the various scores in it. Specifically, it will have the overall matching_score
and sum_weights
, along with the weight and contributing score for each of the semantic equivalence contributing properties.
For example:
In [17]:
ta5 = ThreatActor(
threat_actor_types=["crime-syndicate", "spy"],
name="Evil Org",
aliases=["super-evil"],
)
ta6 = ThreatActor(
threat_actor_types=["spy"],
name="James Bond",
aliases=["007"],
)
prop_scores = {}
print("Semantic equivalence score using standard weights: %s" % (env.semantically_equivalent(ta5, ta6, prop_scores)))
print(prop_scores)
for prop in prop_scores:
if prop not in ["matching_score", "sum_weights"]:
print ("Prop: %s | weight: %s | contributing_score: %s" % (prop, prop_scores[prop]['weight'], prop_scores[prop]['contributing_score']))
else:
print ("%s: %s" % (prop, prop_scores[prop]))
Out[17]:
Out[17]:
Out[17]:
Out[17]:
Out[17]:
Out[17]:
Out[17]:
If you wish, you can customize semantic equivalence comparisons. Specifically, you can do any of three things:
weights
dictionaryIn order to do any of the aforementioned (optional) custom comparisons, you will need to provide a weights
dictionary as the last parameter to the semantically_equivalent() method call.
The weights dictionary should contain both the weight and the comparison function for each property. You may use the default weights and functions, or provide your own.
For reference, here is a list of the comparison functions already built in the codebase (found in stix2/environment.py):
For instance, if we wanted to compare two of the ThreatActor
s from before, but use our own weights, then we could do the following:
In [18]:
weights = {
"threat-actor": { # You must specify the object type
"name": (30, stix2.environment.partial_string_based), # Each property's value must be a tuple
"threat_actor_types": (50, stix2.environment.partial_list_based), # The 1st component must be the weight
"aliases": (20, stix2.environment.partial_list_based) # The 2nd component must be the comparison function
}
}
print("Using standard weights: %s" % (env.semantically_equivalent(ta5, ta6)))
print("Using custom weights: %s" % (env.semantically_equivalent(ta5, ta6, **weights)))
Out[18]:
Out[18]:
In [19]:
prop_scores = {}
weights = {
"threat-actor": {
"name": (45, stix2.environment.partial_string_based),
"threat_actor_types": (10, stix2.environment.partial_list_based),
"aliases": (45, stix2.environment.partial_list_based),
},
}
env.semantically_equivalent(ta5, ta6, prop_scores, **weights)
print(prop_scores)
Out[19]:
Out[19]:
In [20]:
def my_string_compare(p1, p2):
if p1 == p2:
return 1
else:
return 0
weights = {
"threat-actor": {
"name": (45, my_string_compare),
"threat_actor_types": (10, stix2.environment.partial_list_based),
"aliases": (45, stix2.environment.partial_list_based),
},
}
print("Using custom string comparison: %s" % (env.semantically_equivalent(ta5, ta6, **weights)))
Out[20]:
You can also customize the comparison of an entire object type instead of just how each property is compared. To do this, provide a weights
dictionary to semantically_equivalent()
and in this dictionary include a key of "method"
whose value is your custom semantic equivalence function for that object type.
If you provide your own custom semantic equivalence method, you must also provide the weights for each of the properties (unless, for some reason, your custom method is weights-agnostic). However, since you are writing the custom method, your weights need not necessarily follow the tuple format specified in the above code box.
Note also that if you want detailed results with prop_scores
you will need to implement that in your custom function, but you are not required to do so.
In this next example we use our own custom semantic equivalence function to compare two ThreatActor
s, and do not support prop_scores
.
In [21]:
def custom_semantic_equivalence_method(obj1, obj2, **weights):
sum_weights = 0
matching_score = 0
# Compare name
w = weights['name']
sum_weights += w
contributing_score = w * stix2.environment.partial_string_based(obj1['name'], obj2['name'])
matching_score += contributing_score
# Compare aliases only for spies
if 'spy' in obj1['threat_actor_types'] + obj2['threat_actor_types']:
w = weights['aliases']
sum_weights += w
contributing_score = w * stix2.environment.partial_list_based(obj1['aliases'], obj2['aliases'])
matching_score += contributing_score
return matching_score, sum_weights
weights = {
"threat-actor": {
"name": 60,
"aliases": 40,
"method": custom_semantic_equivalence_method
}
}
print("Using standard weights: %s" % (env.semantically_equivalent(ta5, ta6)))
print("Using a custom method: %s" % (env.semantically_equivalent(ta5, ta6, **weights)))
Out[21]:
Out[21]:
You can also write custom functions for comparing objects of your own custom types. Like in the previous example, you can use the built-in functions listed above to help with this, or write your own. In the following example we define semantic equivalence for our new x-foobar
object type. Notice that this time we have included support for detailed results with prop_scores
.
In [22]:
def _x_foobar_checks(obj1, obj2, prop_scores, **weights):
matching_score = 0.0
sum_weights = 0.0
if stix2.environment.check_property_present("name", obj1, obj2):
w = weights["name"]
sum_weights += w
contributing_score = w * stix2.environment.partial_string_based(obj1["name"], obj2["name"])
matching_score += contributing_score
prop_scores["name"] = (w, contributing_score)
if stix2.environment.check_property_present("color", obj1, obj2):
w = weights["color"]
sum_weights += w
contributing_score = w * stix2.environment.partial_string_based(obj1["color"], obj2["color"])
matching_score += contributing_score
prop_scores["color"] = (w, contributing_score)
prop_scores["matching_score"] = matching_score
prop_scores["sum_weights"] = sum_weights
return matching_score, sum_weights
prop_scores = {}
weights = {
"x-foobar": {
"name": 60,
"color": 40,
"method": _x_foobar_checks,
},
"_internal": {
"ignore_spec_version": False,
},
}
foo1 = {
"type":"x-foobar",
"id":"x-foobar--0c7b5b88-8ff7-4a4d-aa9d-feb398cd0061",
"name": "Zot",
"color": "red",
}
foo2 = {
"type":"x-foobar",
"id":"x-foobar--0c7b5b88-8ff7-4a4d-aa9d-feb398cd0061",
"name": "Zot",
"color": "blue",
}
print(env.semantically_equivalent(foo1, foo2, prop_scores, **weights))
print(prop_scores)
Out[22]:
Out[22]: