For starters, 50 percent of respondents say they record an author's name exactly as it appears on the publication:
22 December 2009
Names in Australian repositories: I'd say you want a revolution ...
For starters, 50 percent of respondents say they record an author's name exactly as it appears on the publication:
28 October 2009
How does your organisation differentiate between two people with the same name?
19 October 2009
Problems with identity: why we need to be careful
For researchers, there are a whole series of consequences of not managing publication names. For starters, when a database can't match J Smith and Jane Smith, citation counts and the metrics based on them become distorted. Citations belonging to a single person but distributed across name versions can be called 'split citation'.
Then there's 'mixed citation', which happens when work by two people with the same name is jumbled together. There's nothing worse than someone else taking credit for your masterpiece (or, for that matter, having to take the rap for someone else's ill-conceived ideas ...). I've just found a recent article from Nature that highlights a particularly dramatic case of 'mixed citation'.
Surgeon Liu Hui had a common name ... those of us with common names usually consider this a curse. But Dr Hui wasn't worried. In fact, he turned the ambiguity of his identity to his advantage. He added the publications of all the other Liu Huis he could find to his CV to make it look better. And it worked.
For those who believe this kind of academic fraud is always going to be found out, you're right. Hui was dismissed in 2006. But not before he became Assistant Dean at Tsinghua University on the back of his impressive publication record.
Moral of this story: name management is very, very important.
15 October 2009
Progress Report - 15 October 2009
We haven’t had any monthly progress reports in a while, so I have prepared a brief progress report to update everyone on the project status. As we move towards the last phase of the project, everybody has been working hard on the project outputs the team has defined for the NicNames project. The status of these outputs is listed below.
1. Project Plan
This has been finalized to reflect any changes to the project outcomes.
2. Review of global developments classified by possible use
A review has been carried out and an updated literature review report is being completed.
3. Stakeholder requirements analysis
Requirements of key stakeholders have been identified and documented.
4. Institutional analysis
Current methods of name authority at key institutions have been identified and documented.
5. Analysis of relevant schema and standards
Current and developing standards, schema and mapping relating to names have been analyzed. A report on preferred schema, standards and mappings for the project is being completed.
6. System specification
Requirements for the prototype application and tools have been documented. These identify the functional requirements for the NicNames project, formally set out system use cases and define the agreed scope of work to meet the requirements.
7. Guidelines toolkit
A usability study has been completed, and the outcomes are being used to generate a set of procedures for dealing with personal names in institutional repositories. Documentation for the prototype application is being developed.
8. One or more open source applications/tools
Development of a prototype NicNames application and supporting tools has progressed well and a large part of the web interface has been completed.
9. Implementation plan
Site visits for the implementation of the prototype application at partner institutions has been scheduled for the week of
10. Project evaluation report with recommendations for further action
11. Release Plan
The evaluation report and release plan will be formally prepared as we move further along in the final phase of the project.
08 October 2009
All quiet on the NicNames front?
The JISC Names Project released its Phase One final report in July. This partnership between the University of Manchester and the British Library is building a national authority file for the whole of the UK. It's an ambitious task, and we salute them for it. They've already released a prototype of their web service; you can have a play here (I did).
Also in July, Peter Sefton from the CAIRSS Project wrote a blog post about how a NicNames web service might interact with People Australia (I particularly liked the picture of the happy repository manager and hope that will be me soon ...)
The scholarly literature is also reflecting some very interesting developments. I summarised Dorothea Salo's paper on the absence of name authority control in institutional repositories in an earlier post. It's exciting to see that the big journals are starting to weigh in on the action, too. If 2008 will be remembered as the year The Lancet published an article about two clinical researchers who had decided to become numbers, 2009 was the year Science started to care about names. Both articles discussed the merits of the ResearcherID product from Thomson Reuters, which they described as 'ready and available now'. (I'm not so sure about that ...)
And finally, a few weeks ago, Ernesto Ruelas Inzunza from Dartmouth published what looks like a very interesting paper, 'Writing and citing 'international' names'. As soon as I can get my hands on a copy, I'll let you know all about it.
Interested in more literature about names? Feel free to contact Rebecca.
18 September 2009
The draft project plan has been finalised to reflect any changes to the project outcomes as the project has progressed and the requirements have been refined, and to reflect the new completion dates of the project. This has been released as The ARROW NicNames Project Project Plan Version 1.1.
10 August 2009
Of Beagles and men: a cautionary tale of Charles Darwins
Sure, researchers at our institutions publish work under a variety of name variants. But we know who they (really) are, so why not slap the same name on all their papers so they're easier for our users to find? It's how libraries do it.
Susan Stone from Intergalactic University might produce research as Sue Stone, S. Stone, S. G. Stone, S. Gilligan Stone and write horror novels as Susan Sly Stallone, but we could ignore all that untidiness and bring everything together under a single authoritative name. It would be so much neater.
Except that it's not. And I'm going to prove it.
Today, while I was adding some new records to Swinburne Research Bank, I noticed a familiar author name come up on a paper. Let's say the name was Charles Darwin so I don't have to give away any real names. And the artist to be known as Charles Darwin is familiar to me because he works at Swinburne's humanities faculty and has contributed lots of his work already (under the name Charles Darwin).
But the Charles Darwin I saw today was not Charles Darwin, Swinburne atheist. It was Charles Darwin from MIT, co-author of Emily Pankhurst, Swinburne professor of innovation.
Oh dear. So now what?
Here's how Swinburne Research Bank's author browse handles two completely different Charles Darwins spelled exactly the same way.
Would your repository be any different? I doubt it.
If all we have is their names, two Charles Darwins will always appear as the same person, both in search results and browse menus. And in a display like this, there's no way for our users tell the difference between them.
We need to be able to record and present defining features---such as fields of research and institutional affiliations---to be able to make sense of these names. It's all about context. And this is where NicNames will come in handy.
In the meantime, we at Swinburne Research Bank have a problem, and we won't be alone. I bet there's more than one Susan Smith out there. How are you going to answer when she knocks at your door?
--Rebecca Parker, NicNames Subject Matter Expert
Note: There are a few Easter eggs in the browse table. Be sure to let me know in the Comments if you find them.
12 June 2009
NicNames as a webservice
Since the Valet environment covers self-submission of data to the repository there are (a) some restrictions on access to NicName methods and (b) requirements for repository staff to later validate name entries when input from a Valet environment (The X-Files element - trust no one!). The security also covers harvesting attempts where NicNames data can be extracted (OAI-PMH format) covering the name(s) and a defined set/subset of existing data - the definition as set down by the associated repository manager(s) and so limiting access to such as staff IDs etc that could be exploited as part of identity theft etc. but are essential for disambiguation methods.
As a webservice NicNames is dominantly passive; population of the DB with existing names from the repository done in the form of repository staff extracting data into an XML file and submission of that file to NicNames. Once data has been added, any additional data defined when setting up the NicNames schema is required to be added to the system. The amount of data is determined by the repository manager or else one can accept the default schema that is comprehensive in its coverage of data usable to aid in the disambiguation process.
The simplicity of the approach i.e. a webservice that enables the 'transcending' of current authority control data (as MARC format etc), hides the complexity in use of that data to disambiguate names where the essential feature of NicNames is in the speed and precision achieved in the disambiguation focus. The additional benefits include access to additional metadata beyond their use in resolving ambiguities.
NicNames & Disambiguation - moving into higher dimensions
"It is a lot like the difference between solids, where the atoms are locked into place, and fluids, where the atoms tumble over one another at random. But right in between the two extremes, at a kind of abstract phase transition called the edge of chaos, you also find complexity: a class of behaviors in which the components of the system never quite lock into place, yet never quite dissolve into turbulence, either. These are the systems that are both stable enough to store information, and yet evanescent enough to transmit it. These are the systems that can be organized to perform complex computations, to react to the world, to be spontaneous, adaptive, and alive." M. Mitchell Waldrop, from Complexity [p. 293]
We are dealing with an area of mathematics called 'hinge theory':
Plastic Hinge Theory covers http://en.wikipedia.org/wiki/Plastic_hinge
The emphasis is on the "plastic rotation [deformation] of an otherwise rigid column connection" - for us the 'rigid column connection' is the key, the identifier, for people in the form of a list of names. As such we are focused on the static/dynamic, the solid/fluid border of identity.
The use of Baysian probabilities introduces a partials perspective as we try to identify the 'whole' but is still focused on a one-dimensional POV and this issue is under consideration whilst at the same time being focused on the more practical implementation of a refined one-dimensional POV methodology; refinement in the form of the metadata schema of NicNames allowing for extended analysis of name associations and so extending current authority control material used in the disambiguation process.
29 May 2009
‘Monthly’ Progress Report for May 2009
Because of the earlier delays, the NicNames project has now been extended until the end of October. This will give us time to complete the tasks we have set for ourselves and, hopefully, produce a set of applications and documentation which will not only describe the problem, but provide Institutional Repository managers with a way to deal with it.
We have a Business Requirements Specification and will soon have systems and application specifications. By the end of June we should have a working application and by about the end of August a completed usability study and a guidelines toolkit.
These products will then be implemented and tested in the partner institutions, Swinburne University of Technology, University of Newcastle and University of New South Wales, during September and October and once we are happy with the way it works, it will all be released to the wider IR community.
Stay tuned.
04 March 2009
The trouble with names is they belong to people
As a repository manager, Salo is aware that name authority problems have a significant impact, not just for librarians responsible for content management in repositories, but also on repository users and the discoverability of our content. She believes that one of the reasons for the problems we experience managing author names is that we never envisaged our institutional repositories as library-managed databases; they were meant to be 'do-it-yourself' (Salo 2009) author deposit mechanisms. This means we didn't plan how to control our author metadata in the first instance.
But even if we had, how would we have controlled it?
Traditional cataloguing standards like AACR2 are designed by librarians for librarians, and for library systems frankly more concerned with stock inventory than resource discovery. Authors have no input in the way their works are represented in a library catalogue; cataloguing standards treat them as just another piece of descriptive metadata.
Whether we populate our repositories through self-deposit or librarians recruiting content themselves, there's no doubt that authors are much more to IRs than just another metadata element.
For starters, without authors institutional repositories have no content, and without content, they don't exist. And the location of authors at the time they create a work is the sole basis for their inclusion in an institutional repository's collection.
Salo's viewpoint is that the problems with consistency in repository content are tied to software. But a quick glance at institutional repositories using a variety of software solutions shows that name variant problems affect them all. No single repository, regardless of architecture, can escape this issue, because it's not a software but a human element. And people are always much trickier than technology.
Salo believes that eventually institutional repository software will improve, and that '[i]n the meantime, institutional-repository managers can only plan to plow large amounts of staff time into managing names' (Salo 2009). But the truth is that it's not that easy. We've already spent inordinate amounts of time trying to find a way to manage author names in Swinburne Research Bank, and we've drawn a blank.
Salo notes that EPrints software (unlike DSpace and Fedora) has an autocomplete function, which allows depositors to select from names in the repository's existing vocabulary when they create author metadata. But this is not a long-term solution. While it might help with cases where authors use their initials on some papers and their full names on others (assuming we're comfortable with overwriting these differences---and that's a big assumption), it's just not appropriate when authors associate a different identity with a particular name variant (eg a married name, legal change of name, etc).
Names are not just about software. They're about people.
And this is where NicNames comes in.
--Rebecca Parker, NicNames Subject Matter Expert
UPDATE: The article has now been published, and is available here.
30 January 2009
Draft specification for NicNames application
From a high level, it describes:
- Data model
- User interface
- Query service
- Bulk import or 'harvesting'
>> Nicnames spec 0.1.1 20090130 (PDF, 248KB)
23 January 2009
The logic of persistent identifiers
“Authority control is the process of grouping multiple terms for the same entity into a single record for the purposes of disambiguation and collocation”1 and has a long history in the library world. But, because of that long history, some practices have accumulated which are not appropriate in a digital context.
In particular, the authorised (form of name) heading concept is an artefact of card catalogues, which was used as a mechanism to collocate entries for all works (or more precisely FRBR group 1 entities: Work, Expression, Manifestation, Item) by a named entity (more precisely a FRBR group 2 entity: a person or a corporate body), including those created under variant forms of name. See and See also entries (tracings) were then used to refer to the main entry authorised form.
The authorised form of name used in this way also, confusingly, concatenates a particular name form with collocation.
In a digital environment, we don’t need an authorised form of name because any form of name can be used to link to all works by the named entity. But, we do need some form of persistent identifier (PID) to identify the group 2 entities to which the variant names and group 1 entities can be linked.
That PID could be in the form of a URI which links to information about the group 2 entity, but it should be noted that that again concatenates two logically distinct functions; that is, (a) providing a linking function between group 2 (named) entities, their names and works (group 1 entities) and (b) providing information about the group two entity.
In a local system, the PID could be as simple as any non-meaningful (that is, not linked to or derived from any data in the record) (most likely numeric) string. As long as suitable policies2, such as those developed by the PILIN project, are in place and resources provided to implement the policies, then such PIDs will work for local purposes.
However, in a situation where there is a need to identify a group 2 entity beyond the local system, as is the case with the NicNames Project, a higher level PID is required. This is because we are now trying to link namedEntityA@Swin with namedEntityA@UNSW with namedEntityA@UNew. That is, an Australian researcher may have works deposited at any of a number of Australian research repositories and we want to be able to identify both the works and any authority data not held locally.
This is where an educational or national name identification service, such as the National Library of Australia's People Australia service, could play an important role.
If the first repository to generate authority data for a researcher submits it to People Australia, a PID could be assigned for that researcher which other repositories could then use when incorporating the authority data into their own systems. If works (group 1 entities) were also linked to the authority data, then, in principle, it should be possible to easily find all works by that researcher, in whatever repository they happen to reside.
The implications of this logic are that each repository creates authority data for new researchers as they deposit work into the repository. That authority data, including any attached works and any relevant entity attributes, is submitted to People Australia, who assign a PID which is later added to the local record.
When the researcher changes institution and deposits material in that institution’s repository, the authority data is retrieved from People Australia and incorporated into the local system complete with the already assigned PID. The new work and any further attributes, such as the new affiliation, is then added to the authority data and resubmitted to People Australia.
It should then be possible, in principle, to incorporate a metasearching component into repository searches which will query People Australia to retrieve all works by a given researcher.
References
- Norrish, Jamie (2007). EATS: an entity authority tool set. http://researcharchive.vuw.ac.nz/handle/10063/220
- Nicholas, Nick, Ward, Nigel and Blinco, Kerry (2009). A policy checklist for enabling persistence of identifiers. D-Lib Magazine. 15 (1/2). http://www.dlib.org/dlib/january09/nicholas/01nicholas.html
09 January 2009
Progress Report January 2009
Most of the team is back at work this week after a break over the Christmas New Year period and pressing on with the project. Our Business Analyst, Damien Ingle, started just before Christmas and has spent some time with both Swinburne and Newcastle staff gathering information to feed into stakeholder requirements and institutional analyses.
Rebecca Parker, our Subject Matter Expert (because she actually works with the Swinburne Research Bank repository) is putting together a set of researcher personal name use cases and our Programmer, Tom Rutter, has begun development of tools to work with personal names.
Now that we have a better idea of costs, it seems as though there may be enough funding to take the project beyond the original March deadline or to increase the resources devoted to it over a shorter time frame. So, while we are still not being overly ambitious, it may be that we can do a little more than we had originally thought. Fingers crossed.