This article presents a few options for copying graphs between clusters. Most options can also be used to clone a graph as a copy onto the same cluster.
Lesser known implementation details such as several keyspaces per graph or graph indexes add to the complexity when copying a graph database based on DSE Graph, versions 5.1 to 6.7. Some easy options might be applicable, with some caveats.
Option 1: Use graph.io
This option is only suitable for small graphs with less than 10000 vertices or edges. It does not work well with meta-properties.
If your graph is small and has no meta-properties, exporting the graph with graph.io and reimporting it with graph.io is a really simple method.
Side note: meta-properties are properties of properties. You would see them defined with the propertyKey definition, check for the properties keyword within the propertyKey definition, like here:
On cluster A, in the Gremlin console, do the following (assuming your graph is called "test")
//set an alias to your graph :remote config alias g test.g //enable full graph scans schema.config().option('graph.allow_scan').set('true') //write your graph to file graph.io(graphson()).writeGraph('/tmp/test.gson')
On cluster B, create a new graph with the same schema, and copy your gson file to the new cluster. The copy graph does not need to have the same name as the original graph.
You can then import your gson file into the new graph with graph.io like here:
Check if indexes have been created properly after the graph copy. If not, the easiest way is to drop all indexes from the graph schema and add them again.
Option 2: Export your graph with Spark and DSEGraphFrames
This is a good method if analytics is enabled on your cluster, and if the graph is not too big, as you are exporting vertices and edges into files.
The export itself is easy, but you will then need to copy all the data from dsefs to your local filesystem, from there to your new cluster, and then import the data into dsefs there in order to write it into the copy graph.
For the export:
In Spark shell, set a graphframe to your graph (replace with your graphname):
val g = spark.dseGraph("test") g.V.write.json("/tmp/v_json") g.E.write.json("/tmp/e_json")
This will export the vertex and edge data into separate dsefs directories, and you can concatenate the data like here:
dse fs -cat /tmp/v_json/* > local_vertices.json dse fs -cat /tmp/e_json/* > local_edges.json
To copy the graph, you would need to create a graph with the same schema on cluster B, copy all the data from dsefs over to the dsefs on your cluster B, and then you can load your data from spark shell.
val g = spark.dseGraph("test2") g.updateVertices(spark.read.json("/tmp/local_vertices.json")) g.updateEdges(spark.read.json("/tmp/local_edges.json"))
Again, as in the example above, the copy graph does not need to have the same name as the original graph.
Option 3: Backup and restore with OpsCenter
Good option for large graphs, but only if source graph and copy graph can have the same name. This option is not suitable for graph copies onto the same cluster where you would need to change the graph name.
Take a backup of your graph to a location outside of the cluster, like S3, and to restore your graph from this backup to cluster B.
Since the topology of cluster A is likely to be different to cluster B, you will need to create a new graph first on cluster B with exactly the same name that matches the backed up graph. Note that the name has to match, or you won't be able to restore. You need to create the graph, but you do not need to create the schema before the restore.
You can then restore the graph from backup. Important: You need to select the option to truncate tables for the restore.
If you clone a graph this way, you will need to recreate any indexes, as they will not be created automatically if we restore into an existing keyspace. To recreate indexes, drop them from the Gremlin schema and then re-add them to the schema, this will start the indexing process. Note that reindexing your data and rebuilding materialized views takes time and resources.
You might have to do a rolling restart after the restore to be sure that the Gremlin schema is fully loaded.
Note: While you can create the graph schema before the restore, it is not required. It is especially advisable not to create the indexes in the graph schema before attempting to restore the graph. Concurrent streaming with sstableloader, solr re-indexing and building of materialized view can overwhelm the restore process.
Option 4: Manual restore from snapshot
A manual option, no automation of the restore or copy process, and thus most flexibility in the copy process.
OpsCenter uses sstableloader to restore, and this can also be used for a manual restore without OpsCenter. For the manual restore, take a snapshot of both graph keyspaces on your dev cluster, then copy all the sstables over to your production cluster.
Create a new graph with the same schema on your target cluster, the name of the graph does not need to match. Before you load the sstables with sstableloader, you will need to truncate all tables in the two keyspaces that belong to your graph.
Note: You will need to fully restore all tables of both keyspaces, or the graph might have schema issues and will not be traversable.