Arabesque notebook demo

Current Version: 1.0.2-BETA

Arabesque is a distributed graph mining system that enables quick and easy development of graph mining algorithms, while providing a scalable and efficient execution engine running on top of Hadoop.

Benefits of Arabesque:

  • Simple and intuitive API, specially tailored for Graph Mining algorithms.
  • Transparently handling of all complexities associated with these algorithms.
  • Scalable to hundreds of workers.
  • Efficient implementation: negligible overhead compared to equivalent centralized solutions.

Arabesque is open-source with the Apache 2.0 license.

Execution engine

This demo runs over the Spark's execution engine, i.e., one of the alternatives for playing around with Arabesque. For more details about the supported execution engines, please refer to our project on GitHub.

Your algorithm is a computation

Every Arabesque application is defined in terms of its configurations. When setting up an application the user must create a computation class that will represent an algorithm in graphs. The interface for defining a computation is the following:

public interface Computation<E extends Embedding> {
    void init();    
    void initAggregations();
    // ... //    
    void process(E embedding);    
    // ... //    
    boolean shouldExpand(E newEmbedding);    

We highlight above only the main methods that must be implemented in order to create a computation, like MotifsComputation.

Configuring a computation

In order to run a computation, you must now instantiate it though a Configuration object or one of its subclasses. In case of running interatively over Spark's execution engine (like we are doing) the configuration class must be SparkConfiguration.

The following snippet shows how to configure a MotifsComputation. You can also explore another examples here.

In [4]:
import io.arabesque.conf.SparkConfiguration

val config = new SparkConfiguration

// setting which class implements the computation
config.set ("computation", "io.arabesque.examples.motif.MotifComputation")

// graph input path: it can be local or hdfs
val localPath = s"${System.getenv ("ARABESQUE_HOME")}/data/mico-qanat-sortedByDegree-same-label.txt"
config.set ("input_graph_path", localPath)

// make clear to the system whether it must fetch the graph from local file system (file://) or hdfs (hdfs://)
config.set ("input_graph_local", true)

// this is specific to motif computation and represents the deepest level of embedding exploration
config.set ("arabesque.motif.maxsize", 3)

[sparkConf, mainGraphClass=null, embeddingClass=null, computationClass=null]

Computation created and configured ... Let's execute it

We can now create a Spark execution engine passing SparkContext and configuration (see above) to it, call compute() and read the aggregation results with getAggregatedValue:

In [5]:
import io.arabesque.computation.SparkMasterExecutionEngine
import io.arabesque.aggregation.AggregationStorage

// pass to the execution engine the SparkContext and the configuration made in the previous step
val engine = new SparkMasterExecutionEngine (sc, config)

// call compute() to init computation

// it's done, you can now observe the aggregated results by requesting an AggregationStorage by name.
// See MotifsComputation to learn how to do it
engine.getAggregatedValue [AggregationStorage[_,_]] ("motifs")

AggregationStorage{name='motifs', keyValueMap={[0,1-1,1], [1,1-2,1], [0,1-2,1]=12534960, [1,1-2,1], [0,1-2,1]=53546459}}

In [20]:
import io.arabesque.ArabesqueContext

val arab = new ArabesqueContext (sc)
val arabGraph = arab.textFile ("file:///home/viniciusvdias/environments/Arabesque/data/citeseer-single-label.graph")
val motifs = arabGraph.motifs (3)