Using OpenRefine

A certain data set on the British Library’s comic book holdings has a number of issues that makes using this data to answer specific questions rather difficult. In this post, I will use OpenRefine to clean and sort the data such that it can answer a specific question, in this case, “How many of the British Library’s comic books have a questioned place of publication listed?”

As above noted, this data has numerous issues, including:

  • Authors: some authors names are given last, first; some have a period after the name; some a comma after the name; sometimes the same name is entered in different formats
  • Place of Publications: some give only city names, others city and state, others just a country; state names are abbreviated in various ways; some publication places are bracketed; some have multiple entries.
  • Publisher: the same publisher is not always entered the same way, e.g. Titan, Titan Books, Titan [distributor], et.al. all refer to the same publisher
  • Date of Publication: some dates end with a period, some do not, some are bracketed, some are circa dates
    Some entries do not have information for all the fields

The basic problem is in the inconsistent ways information is entered in the spreadsheet.
Not all of these problems need to be resolved to answer the question regarding comics whose place of publication is uncertain, but some of the inconsistencies are best eliminated.

Many of the place of publication entries are bracketed, and if we want to look deeper into the question, we probably want “London?” and “[London?]” counted together.
To do this, I used the the transform tool and told it to remove the brackets in this column. The previews show “[London?]” being transformed to just “London?”

 

Our question can be answered by using the filter text tool in the column for Place of Publication and entering a question mark in the search field, since glancing through the data, this seems to be the way uncertainty is denoted. The text filter limits the view to only those comics that have a question mark in their place of publication. However, looking further into the matter using the facet tool, we see that there are 28 ways in which a question mark appears. Many of them denote the uncertainty in question, others seem to question the spelling of the place’s name.

 

 

 

 

At this point, it is simple enough to specify which of the twenty-eight are relevant and select them to be included. This gives us 233 records for comics whose place of publication is guessed at. And we can at a glance see other information such as London being by far the most common assumed place of publication, with 183 comics out of the 233 or nearly 80%.

 

 

 

 

 

Skip to content