For a while now I’ve been asked if there is an easy way to mass populate metadata fields with a list of choices in Portal. In the back of my mind I thought it would probably be relatively simple to write a simple parser to do this, and I’ve done just that. It can be downloaded here. I’ll use this article to walk through how I did it so you can understand a bit more about Portal metadata field structure.
Right now for the sake of our script, Portal really only has 4 “types” of fields.
-
Special Fields
Non editable (system) fields
Editable fields that do not contain lists of choices
Editable fields that contain lists of choices
So lets look at these field classes.
Special Fields
-
Integer – can only contain whole numbers (ex: 2,5,1232)
Date – can only contain specifically formatted date string
Timestamp – can only contain specifically formatted date/time string
Float – can only contain fractional/float numbers (2.3,1.0,543.683)
Tags – mix of a choice style field and free text field, user extensible
Hierarchy – contains a full relational tree with multiple linked values in a single field
These fields all have very specific requirements and we won’t be using them in a list population script.
Non editable (system) fields
-
System field – These are fields specific to Portal that are searchable, but not editable
XMP field – Embedded metadata values scraped from XMP/EXIF/DC/IPTC values
Since these fields aren’t editable, we won’t be putting any lists into them.
Editable fields that do not contain lists of choices
-
Text
Textarea
These fields contain strings without special formatting unless regex’d in settings. Since we are building lists, we are gong to ignore these field types.
Editable fields that contain lists of choices
-
Multi-choice – selection of elements from a dropdown, can have only one selection at a time
Checkbox – selection of elements from a visual array of checkboxes, can have multiple values
Radio – selection of elements from a visual array with a radio button, can have only one selection at a time
Lookup – selection of elements from a dropdown, can have multiple values
Worksteps – used for process flow but values are still list populated, can have only one selection at a time
These are the fields that we are concerned with for the purpose of this script. There is one important thing to note here – for some reason Lookup fields store their data in a different way and use a different API call to store lists. This isn’t the end of the world, but we do have to differentiate between two “types” of fields inside this class and we also have to sniff what type of field we are working with here so we know which API call to make.
So now that we understand what fields are what, lets dive into the script a little.
First thing we are going to do is import the libraries we need to make this all work.
1 2 3 4 5 6 7 | import json import requests import argparse import csv import os import unicodedata import re |
Next, we build our cli options using the argparse library. This provides us the ability to easily parse command line arguments in any order by defining flags before them. It also provides us a simple help menu and allows us to require some fields.
9 10 11 12 13 14 15 16 | #command line argument parsing parser = argparse.ArgumentParser(description='Batch update values in Portal metadata fields') parser.add_argument('-u','--username',dest='username',metavar="USERNAME",type=str,help="Portal username, uses \"admin\" if not set",default="admin") parser.add_argument('-p','--password',dest='password',metavar="PASSWORD",type=str,help="Portal password",required=True) parser.add_argument('-a','--address',dest='address',metavar="ADDRESS",type=str,help="IP Address or DNS name of Portal server",required=True) parser.add_argument('-f','--field',dest='field',metavar="FIELD",help="Portal metadata field",required=True) parser.add_argument('-i','--input-file',metavar="FILE_PATH",dest='input_file',help="Key/Value input file (line delimited values or csv key/value pairs)",required=True) cli_args = parser.parse_args() |
Next up, we are just going to build a shortcut so we don’t have to type our Portal URL in every REST call for the rest of the script.
18 19 | #format our URL as a shortcut portal_url = 'http://' + cli_args.address + ':8080/API/' |
After that, we are going to build a list of field types that this script works on. From our research above, we’ve come up with dropdown, checkbox, radio, tags, workstep, and lookup field types. You’ll notice these field types aren’t necessarily named the same thing as they are in the Metadata Manager UI. To figure out their programatic names, I had to do some API calls on all the metadata fields to get their programatic keys (more on that later).
21 22 | #define which field types this script works on safe_field_types = ['dropdown','checkbox','radio','tags','workstep','lookup'] |
Next up we define some functions to make our lives easier. First up is the slugify function. This simply makes strings safe for key entry in the system. As of Portal 3.0, all keys are slugified when added via the UI, I wanted to mimic that functionality.
24 25 26 27 28 29 30 31 32 33 34 | def slugify(s): #slugify function to remove special characters from value strings s = s.lower() for c in [' ', '-', '.', '/']: s = s.replace(c, '_') s = re.sub('\W', '', s) s = s.replace('_', ' ') s = re.sub('\s+', ' ', s) s = s.strip() s = s.replace(' ', '-') return s |
This is where things get interesting. To figure out what kind of field I’m working with, I need to get all the information on that field. The field itself is a Vidispine object, so we need to look at the Vidispine API docs to get all the field data back. In this case, we use :8080/API/metadata-field/
to with application/json
headers. In my test environment, here is an example of what I’d get when running on a field.
{ "data": [ { "key": "extradata", "value": "{\"sortable\": false, \"name\": \"Review and Approve\", \"default\": \"In Process\", \"pattern\": null, \"description\": null, \"cascading\": false, \"readonly\": null, \"values\": [{\"value\": \"In Process\", \"key\": \"In Process\"}, {\"value\": \"Review\", \"key\": \"Review\"}, {\"value\": \"Approved\", \"key\": \"Approved\"}, {\"value\": \"Distribution\", \"key\": \"Distribution\"}], \"autoset\": false, \"externalid\": null, \"reusable\": null, \"type\": \"workstep\", \"Film\": {\"required\": true, \"hideifnotset\": false, \"representative\": false}, \"vertical_mode\": false}" } ], "defaultValue": "In Process", "name": "portal_mf116390", "origin": "VX", "stringRestriction": {}, "type": "string-exact" } |
So there is something interesting to note above – we have a data
block, but in there is a key/value pair where the key is extradata
. Inside the value of that key is an entire JSON array that is stringified. So we gotta get that data out and then objectify it so we can get figure out what its values are and what type of field it is. Enter the get_field_data
function
35 36 37 38 39 40 41 42 | def get_field_data(fieldname): #get all information about a field r = requests.get(portal_url + 'metadata-field/' + fieldname,headers={'accept':'application/json'},params={'data':'all'},auth=(cli_args.username,cli_args.password)) for d in r.json().get('data',[]): if d['key'] == 'extradata': return json.loads(str(d['value'])) else: return [] |
Our for loop here looks at the response JSON we got and loops through all the keys looking for extradata
. If it isn’t found, it returns an empty list. If it does find the key, it returns a JSON object of the value for our extradata
key.
Because I’m slightly lazy, had already written and implemented the above function, and needed parent information from the original JSON, I wrote a second function to get ALL the data back as we need that as well. The computational load of making two calls on the field is negligible in any use case I can see for this script, but the *right* way to do it would have been to parse the returned object from this function instead of using our get_field_data
function. This simply gets back the JSON document from VS that we uncovered above into an object in the script. More on why that matters later.
43 44 45 46 | def get_field_document(fieldname): #get all information about a field r = requests.get(portal_url + 'metadata-field/' + fieldname,headers={'accept':'application/json'},params={'data':'all'},auth=(cli_args.username,cli_args.password)) return r.json() |
Since lookups store data in a different way than all other list type fields, we needed a way to get the list of key/value pairs back for them as well in our processing. This function does just that.
47 48 49 50 51 52 53 | def get_lookup_values(fieldname): #return json object of all key/value pairs of a field for lookups. If the field values list is empty, just return an empty list r = requests.get(portal_url + 'metadata-field/' + fieldname + '/values',headers={'accept':'application/json'},auth=(cli_args.username,cli_args.password)) try: return r.json()['field'] except: return [] |
Now that we have information about our fields and the values held within them, we needed a way to parse our CSV or text files to easily update the fields. This function looks at the file, checks to see if it exists, reads it in as a CSV file, then checks for each line of the file how it is formatted, creates a key/value pair, and adds it to a list of dicts. It then returns a unicode list of key/value pair dicts.
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | def get_file_values(input_file): #return json object of all key/value pairs of a field for non lookups if os.path.exists(input_file): try: with open(input_file, 'r') as csvfile: options_obj = csv.reader(csvfile, delimiter=',', quotechar='"') values = [] #this isn't a lookup, so we formulate JSON key/value pairs for row in options_obj: if len(row) == 2: values.append({"key":slugify(row[0]),"value":row[1].rstrip()}) elif len(row) == 1: values.append({"key":slugify(row[0]),"value":row[0].rstrip()}) else: print "Too many options for CSV file, should have two per line" exit() except Exception as e: print "Error trying to parse input file: " + str(e) exit() else: print "File " + input_file + " doesn't exist" exit() return unicode_list(values) |
The last function we had to write was a way to convert existing lists to unicode objects so we had uniformity when we start mixing/matching lists later. This function just takes a list with string encoded values and converts them all to unicode encoded values.
77 78 79 80 81 82 | def unicode_list(list): new_list = [] for pair in list: unidict = dict((k.decode('utf8'), v.decode('utf8')) for k, v in pair.items()) new_list.append(unidict) return new_list |
Now we get into the meat of the script – actually doing stuff. It looks like a lot, but there are really only three things happening. First, we run our functions to get all of the values and document objects from the field we are looking to update.
84 85 86 | #get all the data about a field field_data = get_field_data(cli_args.field) field_document = get_field_document(cli_args.field) |
There are basically two quick checks that happen, one to make sure the field type we have got is one we can process, and if that is true we see if it is a lookup field or not (since the values are stored differently in lookups vs. other fields).
88 89 90 91 92 | #check if our initial call returned good data if field_data != None: #check if our field is an acceptable field if field_data['type'] in safe_field_types: #check what type of field and process |
Now we only have two options – first is what happens if it is not a lookup
field. The first thing we do is concatenate the values from the field currently and then the fields from the file. Then we sort that set and remove duplicates. We then format the data so that it can be sent in a REST call.
Next, we have to figure out what index our extradata
key is at. 99.9% of the time this will be 0, but it isn’t safe to assume, so we check. Then we replace the value for that key with our newly formatted data. Lastly, we make a REST call to post our JSON object back with all the new values and make sure it worked.
93 94 95 96 97 98 99 100 101 102 103 104 105 106 | if field_data['type'] != 'lookup': # get back options for non lookup fields new_values = get_file_values(cli_args.input_file) + field_data['values'] #add existing values to new text file values, sort, and remove duplicates sorted_set = [dict(t) for t in set([tuple(sorted(d.items())) for d in new_values])] #format data for posting back field_data['values'] = sorted_set #find our extradata index in the data object extradata_index = next(index for (index, d) in enumerate(field_document['data']) if d['key'] == 'extradata') #update that index's value with our new updated field data field_document['data'][extradata_index]['value'] = json.dumps(field_data) #put back data to field r = requests.put(portal_url + 'metadata-field/' + cli_args.field, headers={'accept':'application/json','content-type':'application/json'},data=json.dumps(field_document),auth=(cli_args.username,cli_args.password)) r.raise_for_status() |
Otherwise, we know we we are working with a lookup field. The process is similar here, but our list data has to be formatted as XML and posted back in a different way. You can see the comments in this block follow the same path, only difference is formatting our response and how/where we send it.
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | else: # get back options for lookup fields new_values = get_file_values(cli_args.input_file) + get_lookup_values(cli_args.field) #add existing values to new text file values, sort, and remove duplicates sorted_set = [dict(t) for t in set([tuple(sorted(d.items())) for d in new_values])] #format data for posting back lookup_data = {'field':sorted_set} #format as XML since VS is broken for posting JSON right now metadata_doc = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><SimpleMetadataDocument xmlns="http://xml.vidispine.com/schema/vidispine">' for pair in lookup_data['field']: metadata_doc = metadata_doc + '<field><key>' + pair['key'] + '</key><value>' + pair['value'] + '</value></field>' metadata_doc = metadata_doc + '</SimpleMetadataDocument>' #put back data to field r = requests.put(portal_url + 'metadata-field/' + cli_args.field + '/values', headers={'accept':'application/json','content-type':'application/xml'},data=metadata_doc,auth=(cli_args.username,cli_args.password)) r.raise_for_status() |
The only thing left are the closures of our if statements to see if the field we are working with is valid.
123 124 125 126 127 128 | else: print "Can't use this field type with this script. Exiting." exit() else: print "Error finding field " + cli_args.field exit() |
Hopefully this gives you a little insight into how metadata values are stored in a Portal database and how you can manipulate them.