22 December 2009

Names in Australian repositories: I'd say you want a revolution ...

Have you ever wondered how your colleagues manage the storage and display of author names in their repositories? Well, wonder no more! A few months ago the NicNames Project surveyed Australian repository managers to discover more about the way they store and display names in their repositories. The results are a snapshot of the metadata stored in Australian repositories, and we think they're really fascinating.

For starters, 50 percent of respondents say they record an author's name exactly as it appears on the publication:

Did you expect that? Given how many repository managers are librarians (and therefore schooled in authority control), plus how far we already distort our repositories to meet the requirements of ERA, I'm surprised that well under half of repository managers are not using either the HR name and/or another method of authority control.

Then again, perhaps that's because we only asked about the display of names in the repository. When we asked what other variants were being collected, the figures tipped a little:

Many people are wondering whether the NicNames Project is building a national authority file for researchers. My answer is no. That's a job for someone else. Our brief is to help you find practical ways to manage names in your repositories. And authority files are not practical for IRs. Here's why.

1. Think about what you need to build a traditional authority file. One of the first match points is date of birth. But it's generally not stored for authors in Australian repositories:

I'd love to know why. Is it that you feel it's inappropriate to record the date of birth for living people? Or would you like to record it but the data isn't available to more than 2 of you?

2. Between the absence of dates of birth and the increased trend towards recording authors' names as they appear on publications, it looks as though we're not storing much of what's expected for standard authority files. This sounds to me like resounding support for the idea that repositories are moving away from conventional attitudes of 'control' and 'authority' towards a more flexible idea of versions of names appearing within a particular context.

To give you an example, here's something that repositories store that other (more controlled) systems don't:

3. FORs may not be a perfect classification scheme, but they do provide a controlled vocabulary of Australasian research disciplines. And when they're read in conjunction with details about co-authors (recorded on every publication) and affiliation (recorded in over half of Australian repositories), they tell us a lot about a person's research identity.

And this may well be far more valuable to help us tell people apart in a scholarly publishing context than their dates of birth. Any thoughts, anyone?

'You say you got a real solution, well you know ... we'd all love to see the plan'
- Lennon/McCartney

28 October 2009

How does your organisation differentiate between two people with the same name?

I was flying back to Melbourne after visiting the other NicNames partners last week, when a curiously topical thing happened to me on board the plane.

After mistakenly giving two passengers the same boarding pass, thereby allocating them the same seat (a physical impossibility), it became clear as they introduced themselves to the flight attendants that both unfortunate passengers had exactly the same name - first and last. It wasn't a particularly common name, but it was a coincidence.

As the aircraft was entirely full, there was nowhere for one of the two same-named passengers to sit, so it delayed the flight for around 20 minutes as flight attendants and the second passenger walked up and down the aisles looking a bit stressed.

It's an example of the sort of thing that can go wrong when the only identifier you have for telling people apart is their name. The airline (or whoever printed up that second boarding pass for that "same" person) suffered from one of the two causes of problems NicNames aims to prevent: assuming two dealings with people with the same name mean they are the same person.

The two passengers presumably had a booking reference number, in addition to their name, to identify themselves to check-in staff (or machines). Presumably the mistake happened when someone looked up the second passenger by name, and found the other passenger's record, already with an allocated seat. They then went on to fill every other seat in the aircraft.

I don't know what happened to the extra passenger in the end - whether he got a free upgrade to business class, or was kicked off the plane. However, a similarity can be drawn to the experience of searching through citations in a repository only to find two people's work muddled in together under the same author heading.

19 October 2009

Problems with identity: why we need to be careful

There are many reasons why it's important to be able to match or disambiguate the names of people publishing in the scholarly literature. Some are administrative and involve better back-end management of names in institutional repositories. Some relate to users and how the display of name variants in repository interfaces can help their search or even confuse them further.

For researchers, there are a whole series of consequences of not managing publication names. For starters, when a database can't match J Smith and Jane Smith, citation counts and the metrics based on them become distorted. Citations belonging to a single person but distributed across name versions can be called 'split citation'.

Then there's 'mixed citation', which happens when work by two people with the same name is jumbled together. There's nothing worse than someone else taking credit for your masterpiece (or, for that matter, having to take the rap for someone else's ill-conceived ideas ...). I've just found a recent article from Nature that highlights a particularly dramatic case of 'mixed citation'.

Surgeon Liu Hui had a common name ... those of us with common names usually consider this a curse. But Dr Hui wasn't worried. In fact, he turned the ambiguity of his identity to his advantage. He added the publications of all the other Liu Huis he could find to his CV to make it look better. And it worked.

For those who believe this kind of academic fraud is always going to be found out, you're right. Hui was dismissed in 2006. But not before he became Assistant Dean at Tsinghua University on the back of his impressive publication record.

Moral of this story: name management is very, very important.

15 October 2009

Progress Report - 15 October 2009

We haven’t had any monthly progress reports in a while, so I have prepared a brief progress report to update everyone on the project status. As we move towards the last phase of the project, everybody has been working hard on the project outputs the team has defined for the NicNames project. The status of these outputs is listed below.

1. Project Plan
This has been finalized to reflect any changes to the project outcomes.

2. Review of global developments classified by possible use
A review has been carried out and an updated literature review report is being completed.

3. Stakeholder requirements analysis
Requirements of key stakeholders have been identified and documented.

4. Institutional analysis
Current methods of name authority at key institutions have been identified and documented.

5. Analysis of relevant schema and standards
Current and developing standards, schema and mapping relating to names have been analyzed. A report on preferred schema, standards and mappings for the project is being completed.

6. System specification
Requirements for the prototype application and tools have been documented. These identify the functional requirements for the NicNames project, formally set out system use cases and define the agreed scope of work to meet the requirements.

7. Guidelines toolkit
A usability study has been completed, and the outcomes are being used to generate a set of procedures for dealing with personal names in institutional repositories. Documentation for the prototype application is being developed.

8. One or more open source applications/tools
Development of a prototype NicNames application and supporting tools has progressed well and a large part of the web interface has been completed.

9. Implementation plan
Site visits for the implementation of the prototype application at partner institutions has been scheduled for the week of 19/10/2009. A draft implementation plan has been prepared for the site visits.

10. Project evaluation report with recommendations for further action
11. Release Plan
The evaluation report and release plan will be formally prepared as we move further along in the final phase of the project.

08 October 2009

All quiet on the NicNames front?

It has been a little quiet over here lately. At the moment, I'm writing a revised literature review on names. The JISC landscape review was a great summary of the names environment in June 2008, but it has been a busy year in our area and we'd like to share some of the more interesting new literature with you as well.

The JISC Names Project released its Phase One final report in July. This partnership between the University of Manchester and the British Library is building a national authority file for the whole of the UK. It's an ambitious task, and we salute them for it. They've already released a prototype of their web service; you can have a play here (I did).

Also in July, Peter Sefton from the CAIRSS Project wrote a blog post about how a NicNames web service might interact with People Australia (I particularly liked the picture of the happy repository manager and hope that will be me soon ...)

The scholarly literature is also reflecting some very interesting developments. I summarised Dorothea Salo's paper on the absence of name authority control in institutional repositories in an earlier post. It's exciting to see that the big journals are starting to weigh in on the action, too. If 2008 will be remembered as the year The Lancet published an article about two clinical researchers who had decided to become numbers, 2009 was the year Science started to care about names. Both articles discussed the merits of the ResearcherID product from Thomson Reuters, which they described as 'ready and available now'. (I'm not so sure about that ...)

And finally, a few weeks ago, Ernesto Ruelas Inzunza from Dartmouth published what looks like a very interesting paper, 'Writing and citing 'international' names'. As soon as I can get my hands on a copy, I'll let you know all about it.

Interested in more literature about names? Feel free to contact Rebecca.

18 September 2009

NicNames Project Plan
The draft project plan has been finalised to reflect any changes to the project outcomes as the project has progressed and the requirements have been refined, and to reflect the new completion dates of the project. This has been released as The ARROW NicNames Project Project Plan Version 1.1.

10 August 2009

Of Beagles and men: a cautionary tale of Charles Darwins

Are you one of the people who finds it difficult to see the problem we're trying to address with NicNames? C'mon, don't be shy ... I know you're out there.

Sure, researchers at our institutions publish work under a variety of name variants. But we know who they (really) are, so why not slap the same name on all their papers so they're easier for our users to find? It's how libraries do it.

Susan Stone from Intergalactic University might produce research as Sue Stone, S. Stone, S. G. Stone, S. Gilligan Stone and write horror novels as Susan Sly Stallone, but we could ignore all that untidiness and bring everything together under a single authoritative name. It would be so much neater.

Except that it's not. And I'm going to prove it.

Today, while I was adding some new records to Swinburne Research Bank, I noticed a familiar author name come up on a paper. Let's say the name was Charles Darwin so I don't have to give away any real names. And the artist to be known as Charles Darwin is familiar to me because he works at Swinburne's humanities faculty and has contributed lots of his work already (under the name Charles Darwin).

But the Charles Darwin I saw today was not Charles Darwin, Swinburne atheist. It was Charles Darwin from MIT, co-author of Emily Pankhurst, Swinburne professor of innovation.

Oh dear. So now what?

Here's how Swinburne Research Bank's author browse handles two completely different Charles Darwins spelled exactly the same way.


Would your repository be any different? I doubt it.

If all we have is their names, two Charles Darwins will always appear as the same person, both in search results and browse menus. And in a display like this, there's no way for our users tell the difference between them.

We need to be able to record and present defining features---such as fields of research and institutional affiliations---to be able to make sense of these names. It's all about context. And this is where NicNames will come in handy.

In the meantime, we at Swinburne Research Bank have a problem, and we won't be alone. I bet there's more than one Susan Smith out there. How are you going to answer when she knocks at your door?

--Rebecca Parker, NicNames Subject Matter Expert

Note: There are a few Easter eggs in the browse table. Be sure to let me know in the Comments if you find them.

12 June 2009

NicNames Valet interface etc


An example:

NicNames as a webservice

The current perspective is of NicNames as a webservice and as such supplying an API set allowing for submission of names and the extraction of names and associated metadata. The standard set of DB maintenance methods (Add, Edit, Reports etc) is supplied together with extensions that enable the tie of the service to a web application (e.g. Valet) supplying resolution of names and metadata usable to populate application fields. As such there are two forms of access, via direct access to NicNames or via calls to NicNames through such as Valet or repository management applications (E.g. VITAL) - the latter requiring customisation to integrate NicNames with the application.

Since the Valet environment covers self-submission of data to the repository there are (a) some restrictions on access to NicName methods and (b) requirements for repository staff to later validate name entries when input from a Valet environment (The X-Files element - trust no one!). The security also covers harvesting attempts where NicNames data can be extracted (OAI-PMH format) covering the name(s) and a defined set/subset of existing data - the definition as set down by the associated repository manager(s) and so limiting access to such as staff IDs etc that could be exploited as part of identity theft etc. but are essential for disambiguation methods.

As a webservice NicNames is dominantly passive; population of the DB with existing names from the repository done in the form of repository staff extracting data into an XML file and submission of that file to NicNames. Once data has been added, any additional data defined when setting up the NicNames schema is required to be added to the system. The amount of data is determined by the repository manager or else one can accept the default schema that is comprehensive in its coverage of data usable to aid in the disambiguation process.

The simplicity of the approach i.e. a webservice that enables the 'transcending' of current authority control data (as MARC format etc), hides the complexity in use of that data to disambiguate names where the essential feature of NicNames is in the speed and precision achieved in the disambiguation focus. The additional benefits include access to additional metadata beyond their use in resolving ambiguities.

NicNames & Disambiguation - moving into higher dimensions

In considering disambiguation issues -

"It is a lot like the difference between solids, where the atoms are locked into place, and fluids, where the atoms tumble over one another at random. But right in between the two extremes, at a kind of abstract phase transition called the edge of chaos, you also find complexity: a class of behaviors in which the components of the system never quite lock into place, yet never quite dissolve into turbulence, either. These are the systems that are both stable enough to store information, and yet evanescent enough to transmit it. These are the systems that can be organized to perform complex computations, to react to the world, to be spontaneous, adaptive, and alive." M. Mitchell Waldrop, from Complexity [p. 293]

We are dealing with an area of mathematics called 'hinge theory':

Plastic Hinge Theory covers http://en.wikipedia.org/wiki/Plastic_hinge

The emphasis is on the "plastic rotation [deformation] of an otherwise rigid column connection" - for us the 'rigid column connection' is the key, the identifier, for people in the form of a list of names. As such we are focused on the static/dynamic, the solid/fluid border of identity.

The use of Baysian probabilities introduces a partials perspective as we try to identify the 'whole' but is still focused on a one-dimensional POV and this issue is under consideration whilst at the same time being focused on the more practical implementation of a refined one-dimensional POV methodology; refinement in the form of the metadata schema of NicNames allowing for extended analysis of name associations and so extending current authority control material used in the disambiguation process.

29 May 2009

‘Monthly’ Progress Report for May 2009

I see we haven’t posted an entry here since early March and no progress report since January(!). The excuse is that we have actually been getting on with it.

Because of the earlier delays, the NicNames project has now been extended until the end of October. This will give us time to complete the tasks we have set for ourselves and, hopefully, produce a set of applications and documentation which will not only describe the problem, but provide Institutional Repository managers with a way to deal with it.

We have a Business Requirements Specification and will soon have systems and application specifications. By the end of June we should have a working application and by about the end of August a completed usability study and a guidelines toolkit.

These products will then be implemented and tested in the partner institutions, Swinburne University of Technology, University of Newcastle and University of New South Wales, during September and October and once we are happy with the way it works, it will all be released to the wider IR community.

Stay tuned.

04 March 2009

The trouble with names is they belong to people

I recently read Dorothea Salo's latest article, 'Name authority control in institutional repositories', which will appear in the April issue of Cataloging and Classification Quarterly. (You can find the preprint here).

As a repository manager, Salo is aware that name authority problems have a significant impact, not just for librarians responsible for content management in repositories, but also on repository users and the discoverability of our content. She believes that one of the reasons for the problems we experience managing author names is that we never envisaged our institutional repositories as library-managed databases; they were meant to be 'do-it-yourself' (Salo 2009) author deposit mechanisms. This means we didn't plan how to control our author metadata in the first instance.

But even if we had, how would we have controlled it?

Traditional cataloguing standards like AACR2 are designed by librarians for librarians, and for library systems frankly more concerned with stock inventory than resource discovery. Authors have no input in the way their works are represented in a library catalogue; cataloguing standards treat them as just another piece of descriptive metadata.

Whether we populate our repositories through self-deposit or librarians recruiting content themselves, there's no doubt that authors are much more to IRs than just another metadata element.

For starters, without authors institutional repositories have no content, and without content, they don't exist. And the location of authors at the time they create a work is the sole basis for their inclusion in an institutional repository's collection.

Salo's viewpoint is that the problems with consistency in repository content are tied to software. But a quick glance at institutional repositories using a variety of software solutions shows that name variant problems affect them all. No single repository, regardless of architecture, can escape this issue, because it's not a software but a human element. And people are always much trickier than technology.

Salo believes that eventually institutional repository software will improve, and that '[i]n the meantime, institutional-repository managers can only plan to plow large amounts of staff time into managing names' (Salo 2009). But the truth is that it's not that easy. We've already spent inordinate amounts of time trying to find a way to manage author names in Swinburne Research Bank, and we've drawn a blank.

Salo notes that EPrints software (unlike DSpace and Fedora) has an autocomplete function, which allows depositors to select from names in the repository's existing vocabulary when they create author metadata. But this is not a long-term solution. While it might help with cases where authors use their initials on some papers and their full names on others (assuming we're comfortable with overwriting these differences---and that's a big assumption), it's just not appropriate when authors associate a different identity with a particular name variant (eg a married name, legal change of name, etc).

Names are not just about software. They're about people.
And this is where NicNames comes in.

--Rebecca Parker, NicNames Subject Matter Expert


UPDATE: The article has now been published, and is available here.

30 January 2009

Draft specification for NicNames application

The following document is a rough attempt to describe the way that the NicNames application might work, from my perspective.

From a high level, it describes:
  • Data model
  • User interface
  • Query service
  • Bulk import or 'harvesting'
Some questions are as yet unanswered. All comments are welcome.

>> Nicnames spec 0.1.1 20090130 (PDF, 248KB)

23 January 2009

The logic of persistent identifiers

“Authority control is the process of grouping multiple terms for the same entity into a single record for the purposes of disambiguation and collocation”1 and has a long history in the library world. But, because of that long history, some practices have accumulated which are not appropriate in a digital context.

In particular, the authorised (form of name) heading concept is an artefact of card catalogues, which was used as a mechanism to collocate entries for all works (or more precisely FRBR group 1 entities: Work, Expression, Manifestation, Item) by a named entity (more precisely a FRBR group 2 entity: a person or a corporate body), including those created under variant forms of name. See and See also entries (tracings) were then used to refer to the main entry authorised form.

The authorised form of name used in this way also, confusingly, concatenates a particular name form with collocation.

In a digital environment, we don’t need an authorised form of name because any form of name can be used to link to all works by the named entity. But, we do need some form of persistent identifier (PID) to identify the group 2 entities to which the variant names and group 1 entities can be linked.

That PID could be in the form of a URI which links to information about the group 2 entity, but it should be noted that that again concatenates two logically distinct functions; that is, (a) providing a linking function between group 2 (named) entities, their names and works (group 1 entities) and (b) providing information about the group two entity.

In a local system, the PID could be as simple as any non-meaningful (that is, not linked to or derived from any data in the record) (most likely numeric) string. As long as suitable policies2, such as those developed by the PILIN project, are in place and resources provided to implement the policies, then such PIDs will work for local purposes.

However, in a situation where there is a need to identify a group 2 entity beyond the local system, as is the case with the NicNames Project, a higher level PID is required. This is because we are now trying to link namedEntityA@Swin with namedEntityA@UNSW with namedEntityA@UNew. That is, an Australian researcher may have works deposited at any of a number of Australian research repositories and we want to be able to identify both the works and any authority data not held locally.

This is where an educational or national name identification service, such as the National Library of Australia's People Australia service, could play an important role.

If the first repository to generate authority data for a researcher submits it to People Australia, a PID could be assigned for that researcher which other repositories could then use when incorporating the authority data into their own systems. If works (group 1 entities) were also linked to the authority data, then, in principle, it should be possible to easily find all works by that researcher, in whatever repository they happen to reside.

The implications of this logic are that each repository creates authority data for new researchers as they deposit work into the repository. That authority data, including any attached works and any relevant entity attributes, is submitted to People Australia, who assign a PID which is later added to the local record.

When the researcher changes institution and deposits material in that institution’s repository, the authority data is retrieved from People Australia and incorporated into the local system complete with the already assigned PID. The new work and any further attributes, such as the new affiliation, is then added to the authority data and resubmitted to People Australia.

It should then be possible, in principle, to incorporate a metasearching component into repository searches which will query People Australia to retrieve all works by a given researcher.

References

  1. Norrish, Jamie (2007). EATS: an entity authority tool set. http://researcharchive.vuw.ac.nz/handle/10063/220
  2. Nicholas, Nick, Ward, Nigel and Blinco, Kerry (2009). A policy checklist for enabling persistence of identifiers. D-Lib Magazine. 15 (1/2). http://www.dlib.org/dlib/january09/nicholas/01nicholas.html

09 January 2009

Progress Report January 2009

Happy New Year!

Most of the team is back at work this week after a break over the Christmas New Year period and pressing on with the project. Our Business Analyst, Damien Ingle, started just before Christmas and has spent some time with both Swinburne and Newcastle staff gathering information to feed into stakeholder requirements and institutional analyses.

Rebecca Parker, our Subject Matter Expert (because she actually works with the Swinburne Research Bank repository) is putting together a set of researcher personal name use cases and our Programmer, Tom Rutter, has begun development of tools to work with personal names.

Now that we have a better idea of costs, it seems as though there may be enough funding to take the project beyond the original March deadline or to increase the resources devoted to it over a shorter time frame. So, while we are still not being overly ambitious, it may be that we can do a little more than we had originally thought. Fingers crossed.