Monday, May 17, 2010

CRIS Event Cafe Society Write Up - Group 4: Data Quality

At the JISC/ARMA Repositories and CRIS event 'Learning How to Play Nicely' held at the Rose Bowl, Leeds Met University on Friday 7th May the afternoon was dedicated to a cafe society discussion session. Four topics were explored by delegates and over the course of four blog posts we are disseminating the facilitator reports from each session.

Please use the comment option below to contribute or comment on these discussion topics.

Group 4 - Data Quality
Facilitator: Simon Kerridge, ARMA

The issue to be discussed was Data Quality and it was framed as “How do we ensure data quality in our systems? What are the best methods for getting data out of legacy systems?” however a number of related issues also cropped up in the discussions

The time was split into four 30 minute slots with delegates attending as many times as they liked. Some issued were identified on many occasions and others less often, most are presented.

Unique Identifiers - (for many, perhaps all data items) was considered to be a big issue. Examples included:
• PersonId: usually not a single one is used in an institution; the various systems (eg HR, CRIS, IT, PGR and others) generally used different ids. Moreover the HR system, which might seem like the obvious primary source, might have multiple entries for the same person (if they had more than one contract), but worse, only usually had entries for paid staff – there are many examples of unpaid people involved in research.
• FunderId: many expressed problems with de-duplicating similar looking funders. It was thought that the funders themselves could/should provide a unique reference

Authority Lists
• Even if an institution could de-duplicate all their own data and use a single id internally, it was likely that other institutions would not use the same system and so exchange of data would be problematic. This could be resolved by an agreed independent authority (for example staff HESAid). But one does not exist for (for example) Funders. This was thought to be something that would be extremely useful.
• A national policy on national data (eg FunderId) was seen as desirable
• Scopus / WoS / Pubmed were seen as possible partial authority lists for publications (and authors) but they contain differing information and do not cover the whole spectrum – and indeed not worth using in some subject areas

Data Quality
• Many places have a feedback loop (eg monthly show academic staff what has been added to their profile).
• Use carrots and sticks, eg only allow publications from the IR/CRIS to be used on internal promotions or for the annual report
• One stick method that was generally liked was the Norwegian system where in order to receive public funding for a research project a prerequisite was that all of the authors publications (where possible) had to be submitted to an open access repository
• Good enough is good enough
• Data should be re-used where possible, but only where it is appropriate; sometimes systems can be developed organically to meet too many requirements and end up not doing any of them well
• Try to think about potential future use of data and collect what you might need – but don’t go overboard. For example one institution has additional classification for all publications using the library of congress system, but so far has not used that meta data
• Have processes in place to check data quality on input and as a secondary check to ‘approve’ the data – one institution has a ‘checked by Carol’ flag!
• In general self-archive was not approved of due to the lack of quality and copyright checking
• There is some good software available for data quality checking against publications (using Scopus / WoS / PubMed data) and for data aggregation
• One institution uses Lieberstein string comparison to help identify possible duplicate entries
• The RAE / REF was seen as a good driver for increasing data and data quality
• Periodic data maintenance and cleansing is essential, but often not undertaken – data quality is unsexy!

Data Sharing
• Authority lists would make this much easier – surely some work can be done in this area?
• Two institutions recounted the issues of doing a joint submission to the RAE and the data fusion issues. It simplified a later choice of IR, the second institution simply plumped for the same as the first

Parallel Systems
• Many reported using parallel systems within their institutions as the data in the (normally) central system was simply not trusted by all the users.

Priority
• It was universally agreed that problems tended to occur where an issue was not given a high enough priority by the institution. For example, if a DVC took an interest in the quality of data in the IR then resources were made available to improve the processes and data quality.

Legacy Systems
• Often resources were made available for moving data from a legacy system to a new one
• However this was often seen as solving data quality issues, whereas in reality it is an ongoing issue, but often not resourced as such

Primary Data Source
• It was agreed that there is not one system for all an institutions data needs. Indeed that might not be desirable as individual systems tend to meet different requirements.
• However it should be known where the primary data resides, understanding that for a single record (eg information about staff) this might not all be in one system

Summary (the facilitators view of the discussions)
Overall the discussions were very open and positive. Many participants took away some ideas for use in their own institutions. Most were also sure that they would not find it easy to get the resource required to do a proper job in improving their data quality. Some systems were reportedly working very well, other systems were not. In general the former were the result of new developments whereas the latter tended to be systems that have been in use for a while. Hopefully this is the result of better new technology being used to support processes; however it seems likely that the reason is more to do with system being neglected once they are seen as being embedded and working.

No comments:

Post a Comment