Bulk adding choices to Cantemo Portal metadata fields from generic text files

For a while now I’ve been asked if there is an easy way to mass populate metadata fields with a list of choices in Portal. In the back of my mind I thought it would probably be relatively simple to write a simple parser to do this, and I’ve done just that. It can be downloaded here. I’ll use this article to walk through how I did it so you can understand a bit more about Portal metadata field structure.

Right now for the sake of our script, Portal really only has 4 “types” of fields.

    Special Fields
    Non editable (system) fields
    Editable fields that do not contain lists of choices
    Editable fields that contain lists of choices

So lets look at these field classes.

Special Fields

    Integer – can only contain whole numbers (ex: 2,5,1232)
    Date – can only contain specifically formatted date string
    Timestamp – can only contain specifically formatted date/time string
    Float – can only contain fractional/float numbers (2.3,1.0,543.683)
    Tags – mix of a choice style field and free text field, user extensible
    Hierarchy – contains a full relational tree with multiple linked values in a single field

These fields all have very specific requirements and we won’t be using them in a list population script.

Non editable (system) fields

    System field – These are fields specific to Portal that are searchable, but not editable
    XMP field – Embedded metadata values scraped from XMP/EXIF/DC/IPTC values

Since these fields aren’t editable, we won’t be putting any lists into them.

Editable fields that do not contain lists of choices

    Text
    Textarea

These fields contain strings without special formatting unless regex’d in settings. Since we are building lists, we are gong to ignore these field types.

Editable fields that contain lists of choices

    Multi-choice – selection of elements from a dropdown, can have only one selection at a time
    Checkbox – selection of elements from a visual array of checkboxes, can have multiple values
    Radio – selection of elements from a visual array with a radio button, can have only one selection at a time
    Lookup – selection of elements from a dropdown, can have multiple values
    Worksteps – used for process flow but values are still list populated, can have only one selection at a time

These are the fields that we are concerned with for the purpose of this script. There is one important thing to note here – for some reason Lookup fields store their data in a different way and use a different API call to store lists. This isn’t the end of the world, but we do have to differentiate between two “types” of fields inside this class and we also have to sniff what type of field we are working with here so we know which API call to make.

So now that we understand what fields are what, lets dive into the script a little.

First thing we are going to do is import the libraries we need to make this all work.

1
2
3
4
5
6
7
import json
import requests
import argparse
import csv
import os 
import unicodedata
import re

Next, we build our cli options using the argparse library. This provides us the ability to easily parse command line arguments in any order by defining flags before them. It also provides us a simple help menu and allows us to require some fields.

9
10
11
12
13
14
15
16
#command line argument parsing
parser = argparse.ArgumentParser(description='Batch update values in Portal metadata fields')
parser.add_argument('-u','--username',dest='username',metavar="USERNAME",type=str,help="Portal username, uses \"admin\" if not set",default="admin")
parser.add_argument('-p','--password',dest='password',metavar="PASSWORD",type=str,help="Portal password",required=True)
parser.add_argument('-a','--address',dest='address',metavar="ADDRESS",type=str,help="IP Address or DNS name of Portal server",required=True)
parser.add_argument('-f','--field',dest='field',metavar="FIELD",help="Portal metadata field",required=True)
parser.add_argument('-i','--input-file',metavar="FILE_PATH",dest='input_file',help="Key/Value input file (line delimited values or csv key/value pairs)",required=True)
cli_args = parser.parse_args()

Next up, we are just going to build a shortcut so we don’t have to type our Portal URL in every REST call for the rest of the script.

18
19
#format our URL as a shortcut
portal_url = 'http://' + cli_args.address + ':8080/API/'

After that, we are going to build a list of field types that this script works on. From our research above, we’ve come up with dropdown, checkbox, radio, tags, workstep, and lookup field types. You’ll notice these field types aren’t necessarily named the same thing as they are in the Metadata Manager UI. To figure out their programatic names, I had to do some API calls on all the metadata fields to get their programatic keys (more on that later).

21
22
#define which field types this script works on
safe_field_types = ['dropdown','checkbox','radio','tags','workstep','lookup']

Next up we define some functions to make our lives easier. First up is the slugify function. This simply makes strings safe for key entry in the system. As of Portal 3.0, all keys are slugified when added via the UI, I wanted to mimic that functionality.

24
25
26
27
28
29
30
31
32
33
34
def slugify(s):
	#slugify function to remove special characters from value strings
	s = s.lower()
	for c in [' ', '-', '.', '/']:
		s = s.replace(c, '_')
	s = re.sub('\W', '', s)
	s = s.replace('_', ' ')
	s = re.sub('\s+', ' ', s)
	s = s.strip()			
	s = s.replace(' ', '-')
	return s

This is where things get interesting. To figure out what kind of field I’m working with, I need to get all the information on that field. The field itself is a Vidispine object, so we need to look at the Vidispine API docs to get all the field data back. In this case, we use :8080/API/metadata-field//?data=all to with application/json headers. In my test environment, here is an example of what I’d get when running on a field.

{
    "data": [
        {
            "key": "extradata",
            "value": "{\"sortable\": false, \"name\": \"Review and Approve\", \"default\": \"In Process\", \"pattern\": null, \"description\": null, \"cascading\": false, \"readonly\": null, \"values\": [{\"value\": \"In Process\", \"key\": \"In Process\"}, {\"value\": \"Review\", \"key\": \"Review\"}, {\"value\": \"Approved\", \"key\": \"Approved\"}, {\"value\": \"Distribution\", \"key\": \"Distribution\"}], \"autoset\": false, \"externalid\": null, \"reusable\": null, \"type\": \"workstep\", \"Film\": {\"required\": true, \"hideifnotset\": false, \"representative\": false}, \"vertical_mode\": false}"
        }
    ],
    "defaultValue": "In Process",
    "name": "portal_mf116390",
    "origin": "VX",
    "stringRestriction": {},
    "type": "string-exact"
}

So there is something interesting to note above – we have a data block, but in there is a key/value pair where the key is extradata. Inside the value of that key is an entire JSON array that is stringified. So we gotta get that data out and then objectify it so we can get figure out what its values are and what type of field it is. Enter the get_field_data function

35
36
37
38
39
40
41
42
def get_field_data(fieldname):
	#get all information about a field
	r = requests.get(portal_url + 'metadata-field/' + fieldname,headers={'accept':'application/json'},params={'data':'all'},auth=(cli_args.username,cli_args.password))
	for d in r.json().get('data',[]):
		if d['key'] == 'extradata':
			return json.loads(str(d['value']))
		else:
			return []

Our for loop here looks at the response JSON we got and loops through all the keys looking for extradata. If it isn’t found, it returns an empty list. If it does find the key, it returns a JSON object of the value for our extradata key.

Because I’m slightly lazy, had already written and implemented the above function, and needed parent information from the original JSON, I wrote a second function to get ALL the data back as we need that as well. The computational load of making two calls on the field is negligible in any use case I can see for this script, but the *right* way to do it would have been to parse the returned object from this function instead of using our get_field_data function. This simply gets back the JSON document from VS that we uncovered above into an object in the script. More on why that matters later.

43
44
45
46
def get_field_document(fieldname):
	#get all information about a field
	r = requests.get(portal_url + 'metadata-field/' + fieldname,headers={'accept':'application/json'},params={'data':'all'},auth=(cli_args.username,cli_args.password))
	return r.json()

Since lookups store data in a different way than all other list type fields, we needed a way to get the list of key/value pairs back for them as well in our processing. This function does just that.

47
48
49
50
51
52
53
def get_lookup_values(fieldname):
	#return json object of all key/value pairs of a field for lookups.  If the field values list is empty, just return an empty list
	r = requests.get(portal_url + 'metadata-field/' + fieldname + '/values',headers={'accept':'application/json'},auth=(cli_args.username,cli_args.password))
	try:
		return r.json()['field']		
	except:
		return []

Now that we have information about our fields and the values held within them, we needed a way to parse our CSV or text files to easily update the fields. This function looks at the file, checks to see if it exists, reads it in as a CSV file, then checks for each line of the file how it is formatted, creates a key/value pair, and adds it to a list of dicts. It then returns a unicode list of key/value pair dicts.

54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
 
def get_file_values(input_file):
	#return json object of all key/value pairs of a field for non lookups
	if os.path.exists(input_file):
		try:
			with open(input_file, 'r') as csvfile:
				options_obj = csv.reader(csvfile, delimiter=',', quotechar='"')
				values = []		
				#this isn't a lookup, so we formulate JSON key/value pairs
				for row in options_obj:
					if len(row) == 2:
						values.append({"key":slugify(row[0]),"value":row[1].rstrip()})
					elif len(row) == 1:
						values.append({"key":slugify(row[0]),"value":row[0].rstrip()})
					else:
						print "Too many options for CSV file, should have two per line"
						exit()													
		except Exception as e:
			print "Error trying to parse input file: " + str(e)
			exit()
	else:
		print "File " + input_file + " doesn't exist"
		exit()
	return unicode_list(values)

The last function we had to write was a way to convert existing lists to unicode objects so we had uniformity when we start mixing/matching lists later. This function just takes a list with string encoded values and converts them all to unicode encoded values.

77
78
79
80
81
82
def unicode_list(list):
	new_list = []
	for pair in list:
		unidict = dict((k.decode('utf8'), v.decode('utf8')) for k, v in pair.items())
		new_list.append(unidict)
	return new_list

Now we get into the meat of the script – actually doing stuff. It looks like a lot, but there are really only three things happening. First, we run our functions to get all of the values and document objects from the field we are looking to update.

84
85
86
#get all the data about a field
field_data = get_field_data(cli_args.field)
field_document = get_field_document(cli_args.field)

There are basically two quick checks that happen, one to make sure the field type we have got is one we can process, and if that is true we see if it is a lookup field or not (since the values are stored differently in lookups vs. other fields).

88
89
90
91
92
#check if our initial call returned good data
if field_data != None:
	#check if our field is an acceptable field
	if field_data['type'] in safe_field_types:
		#check what type of field and process

Now we only have two options – first is what happens if it is not a lookup field. The first thing we do is concatenate the values from the field currently and then the fields from the file. Then we sort that set and remove duplicates. We then format the data so that it can be sent in a REST call.

Next, we have to figure out what index our extradata key is at. 99.9% of the time this will be 0, but it isn’t safe to assume, so we check. Then we replace the value for that key with our newly formatted data. Lastly, we make a REST call to post our JSON object back with all the new values and make sure it worked.

93
94
95
96
97
98
99
100
101
102
103
104
105
106
		if field_data['type'] != 'lookup':
			# get back options for non lookup fields
			new_values = get_file_values(cli_args.input_file) + field_data['values']
			#add existing values to new text file values, sort, and remove duplicates
			sorted_set = [dict(t) for t in set([tuple(sorted(d.items())) for d in new_values])]	
			#format data for posting back			
			field_data['values'] = sorted_set								
			#find our extradata index in the data object
			extradata_index = next(index for (index, d) in enumerate(field_document['data']) if d['key'] == 'extradata')
			#update that index's value with our new updated field data
			field_document['data'][extradata_index]['value'] = json.dumps(field_data)
			#put back data to field	
			r = requests.put(portal_url + 'metadata-field/' + cli_args.field, headers={'accept':'application/json','content-type':'application/json'},data=json.dumps(field_document),auth=(cli_args.username,cli_args.password))			
			r.raise_for_status()

Otherwise, we know we we are working with a lookup field. The process is similar here, but our list data has to be formatted as XML and posted back in a different way. You can see the comments in this block follow the same path, only difference is formatting our response and how/where we send it.

107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
		else:
			# get back options for lookup fields		
			new_values = get_file_values(cli_args.input_file) + get_lookup_values(cli_args.field)
			#add existing values to new text file values, sort, and remove duplicates
			sorted_set = [dict(t) for t in set([tuple(sorted(d.items())) for d in new_values])]		
			#format data for posting back
			lookup_data = {'field':sorted_set}
			#format as XML since VS is broken for posting JSON right now
			metadata_doc = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><SimpleMetadataDocument xmlns="http://xml.vidispine.com/schema/vidispine">'
			for pair in lookup_data['field']:
				metadata_doc = metadata_doc + '<field><key>' + pair['key'] + '</key><value>' + pair['value'] + '</value></field>'
			metadata_doc = metadata_doc + '</SimpleMetadataDocument>'
			#put back data to field	
			r = requests.put(portal_url + 'metadata-field/' + cli_args.field + '/values', headers={'accept':'application/json','content-type':'application/xml'},data=metadata_doc,auth=(cli_args.username,cli_args.password))
			r.raise_for_status()

The only thing left are the closures of our if statements to see if the field we are working with is valid.

123
124
125
126
127
128
	else:
		print "Can't use this field type with this script. Exiting."
		exit()
else:
	print "Error finding field " + cli_args.field
	exit()

Hopefully this gives you a little insight into how metadata values are stored in a Portal database and how you can manipulate them.