What's your (citations') style? (2024)

8 minute read.

What’s your (citations’) style?

What's your (citations') style? (1)

Dominika Tkaczyk– 2019 October 29

In CitationCrossref LabsMachine Learning

Bibliographic references in scientific papers are the end result of a process typically composed of: finding the right document to cite, obtaining its metadata, and formatting the metadata using a specific citation style. This end result, however, does not preserve the information about the citation style used to generate it. Can the citation style be somehow guessed from the reference string only?

TL;DR

  • I built an automatic citation style classifier. It classifies a given bibliographic reference string into one of 17 citation styles or “unknown”.
  • The classifier is based on supervised machine learning. It uses TF-IDF feature representation and a simple Logistic Regression model.
  • For training and testing, I used datasets generated automatically from Crossref metadata.
  • The accuracy of the classifier estimated on the test set is 94.7%.
  • The classifier is open source and can be used as a Python library or REST API.

Introduction

Threadgill-Sowder, J. (1983). Question Placement in Mathematical Word Problems. School Science and Mathematics, 83(2), 107-111

This reference is the end result of a process that typically includes: finding the right document, obtaining its metadata, and formatting the metadata using a specific citation style. Sadly, the intermediate reference forms or the details of this process are not preserved in the end result. In general, just by looking at the reference string we cannot be sure which document it originates from, what its metadata is, or which citation style was used.

Global multi-billion dollar fashion industry proves without a doubt that people care about their fashion style. But why should we care about the citation style used to generate a specific reference? This might seem like an insignificant piece of information, but it can be a powerful clue when we try to solve tasks like:

  • Reference parsing, i.e., extracting metadata from the reference string. If the style is known, we also know where to expect metadata fields in the string, and it is typically enough to use simple regular expressions instead of complicated (and slow) machine learning-based parsers.
  • Discipline/topic classification. Citation styles used in documents correlate with their discipline. As a result, knowing the citation style used in the document could provide a useful clue for a discipline classifier.
  • Extracting references from documents. Conforming to a specific style might suggest that the reference string was correctly located within a larger document.

Even though the style is not directly mentioned in the reference string, the string contains useful clues. Some styles will abbreviate the authors’ first names, and others won’t. Some will place the year in parentheses, others separate it with commas. The presence of such fragments in the reference string can be used as the input for the style classifier.

I used these clues to build an automatic style classifier. It takes a single reference string on the input and classifies it into one of 17 styles or “unknown”. You can use it as a Python library or via REST API. The source code is also available. If you find this project useful, I would love to hear about it!

And if you are interested in more details about the classifier and how it was built, read on.

Data

The data for the experiments was generated automatically. The training and the test set were generated in the same way but from two different samples. The process was the following:

  • 5,000 documents were randomly chosen from Crossref collection.
  • Each document was formatted into 17 citation styles. This resulted in 85,000 pairs (reference string, citation style).
  • Very short reference strings were removed. A short reference string typically results from very incomplete metadata of the document.
  • From a number of randomly selected references, I removed fragments like the name of the month. These fragments appear in the automatically generated reference strings because sometimes months are included in the metadata records in Crossref collection. However, they rarely appear in the real-life reference strings, so removing them made the dataset more reliable.
  • 5,000 strings labelled as “unknown” were also added. These were generated by randomly swapping the words in the “real” reference strings.

This process resulted in two sets: training set containing 87,808 data points and test set containing 87,625 data points. The training set was used to choose various classification parameters and to train the final model. The test set was used to obtain the final estimation of the classifier’s accuracy.

Styles

The classifier was trained on the following 17 citation styles (+ “unknown”):

  • acm-sig-proceedings
  • american-chemical-society
  • american-chemical-society-with-titles
  • american-institute-of-physics
  • american-sociological-association
  • apa
  • bmc-bioinformatics
  • chicago-author-date
  • elsevier-without-titles
  • elsevier-with-titles
  • harvard3
  • ieee
  • iso690-author-date-en
  • modern-language-association
  • springer-basic-author-date
  • springer-lecture-notes-in-computer-science
  • vancouver

These 17 styles were chosen to cover a vast majority of references that we see in the real-life data, without including too many variants of very similar styles.

If you need a different style set, fear not. You can use the library to train your own model based on exactly the styles you need.

Features

Our learning algorithm cannot work directly with the raw text on the input. It needs numerical features. In the case of text classification (and reference strings are text), one very common feature representation is bag-of-words. In the simplest variant, each feature represents a single word, and the value of the feature is binary: 1 if the word is present in the text, 0 otherwise.

There are many variants of this representation, for example:

  • The input text typically undergoes normalization before the features are extracted. Depending on the use case, this might include lowercasing, removing punctuation, bringing the words to their canonical form by stemming, etc.
  • We do not have to use single words as features. In some use cases, it is beneficial to use n-grams, which correspond to fixed-length sequences of words.
  • Instead of binary values, we might want to use some other feature weight schemes, such as the famous TF-IDF representation.

Our use case is not a typical case of text classification. We cannot use raw words as features, as words do not carry the information about the citation style. Imagine the same document formatted in different styles –– those reference strings will contain the same words, and the learning algorithm won’t be able to distinguish between them.

As a side note, in some cases, some specific words might be important. For example, if the reference contains the word “algorithm”, chances are the document is from computer science. If so, then perhaps the citing paper is from computer science as well. And in computer science, some styles are more popular than others. Machine learning algorithms are pretty good at detecting such correlations in the data. In the first version of our classifier, however, we do not take this into account. This keeps things simpler.

If not words, then what matters in our case? It seems that the information about the style is present in punctuation, capitalization and abbreviations.

To capture these clues, before extracting the features we first map our reference string into a sequence of “word types” (or “character types”). The types are the following: lowercase-word, lowercase-letter, uppercase-word, uppercase-letter, capitalized-word, other-word, year, number, dot, comma, left-parenthesis, right-parenthesis, left-bracket, right-bracket, colon, semicolon, slash, dash, quote, other.

In addition, we mark the beginning and the end of the reference string with special types start and end.

So for example this string:

Eberlein, T. J. Yearbook of Surgery 2006, 322–324.

is mapped into this sequence:

start capitalized-word comma uppercase-letter dot uppercase-letter dot capitalized-word lowercase-word capitalized-word year comma number dash number dot end

This transformation effectively brings together different words, as long as their form is the same.

After transforming the reference string we extract 2-grams, 3-grams and 4-grams. The values of the features are TF-IDF weights.

Some example features in our representation include:

  • lowercase-word lowercase-word lowercase-word lowercase-word - a sequence of four lowercase words. It is most likely the part of the article title and won’t have a huge impact on the decision about the citation style.
  • capitalized-word comma uppercase-letter dot - typical representation of an author in some styles, where the first name is given as an initial only and follows the last name.
  • left-parenthesis year right-parenthesis - typical for styles that enclose the year in parentheses.
  • number dash number - this sequence is most likely pages range.

Learning algorithm

I tested four learning algorithms (naive Bayes, logistic regression, linear support vector classification and random forest) in a 5-fold cross validation on the training set. The plot shows the distribution of accuracies obtained by each algorithm:

What's your (citations') style? (2)

Based on these results, logistic regression was chosen as the algorithm with the best mean accuracy and the lowest variance of the results.

Final accuracy estimation

The final model was trained on the entire training set and evaluated on the test set. As evaluation metric accuracy was used. In this case, accuracy is simply the fraction of the references in the test set correctly classified by the classifier.

The accuracy on the test set was 94.7%. The confusion matrix shows which styles were most often confused with each other:

What's your (citations') style? (3)

The most often confused styles are chicago-author-date and american-sociological-association. Let’s see some example strings from these two styles:

Legros, F. 2003. "Can Dispersive Pressure Cause Inverse Grading in Grain Flows?: Reply." Journal of Sedimentary Research 73(2):335–335Legros, F. 2003. "Can Dispersive Pressure Cause Inverse Grading in Grain Flows?: Reply." Journal of Sedimentary Research 73 (2) : 335–335
Clarke, Jennie T. 2011. "Recognizing and Managing Reticular Erythematous Mucinosis." Archives of Dermatology 147(6):715Clarke, Jennie T. 2011. "Recognizing and Managing Reticular Erythematous Mucinosis." Archives of Dermatology 147 (6) : 715
Chalmers, Alan, and Richard Nicholas. 1983. "Galileo on the Dissipative Effect of a Rotating Earth." Studies in History and Philosophy of Science Part A 14(4):315–40Chalmers, Alan, and Richard Nicholas. 1983. "Galileo on the Dissipative Effect of a Rotating Earth." Studies in History and Philosophy of Science Part A 14 (4) : 315–340

It seems that the styles are indeed very similar. The strings look almost identical, apart from spacing, which is not included in any way in our feature representation. No wonder that the classifier confuses these two styles a lot.

A more detailed analysis of the classifier can be found here.

Further reading

  • Jul 8, 2019 – What if I told you that bibliographic references can be structured?
  • Dec 18, 2018 – Reference matching: for real this time
  • Nov 12, 2018 – Matchmaker, matchmaker, make me a match
  • Nov 9, 2018 – What does the sample say?
  • Mar 3, 2015 – Real-time Stream of DOIs being cited in Wikipedia
  • Mar 2, 2015 – Crossref’s DOI Event Tracker Pilot
  • Jan 12, 2015 – Introducing the Crossref Labs DOI Chronograph
  • Aug 7, 2014 – Citation needed
What's your (citations') style? (2024)

References

Top Articles
Who Is In Sleep Number Commercial – Repeat Replay
Who Is On Sleep Number Commercial – Repeat Replay
Palm Coast Permits Online
Research Tome Neltharus
Lighthouse Diner Taylorsville Menu
Mohawkind Docagent
Vanadium Conan Exiles
Meg 2: The Trench Showtimes Near Phoenix Theatres Laurel Park
Mercy MyPay (Online Pay Stubs) / mercy-mypay-online-pay-stubs.pdf / PDF4PRO
Zoebaby222
Helloid Worthington Login
سریال رویای شیرین جوانی قسمت 338
Viha Email Login
Buff Cookie Only Fans
Letter F Logos - 178+ Best Letter F Logo Ideas. Free Letter F Logo Maker. | 99designs
Kitty Piggy Ssbbw
Velocity. The Revolutionary Way to Measure in Scrum
Gia_Divine
Is The Yankees Game Postponed Tonight
Why Does Lawrence Jones Have Ptsd
Allentown Craigslist Heavy Equipment
2024 INFINITI Q50 Specs, Trims, Dimensions & Prices
Catherine Christiane Cruz
Riherds Ky Scoreboard
Dr Ayad Alsaadi
Puretalkusa.com/Amac
Mj Nails Derby Ct
Mega Personal St Louis
Sunset Time November 5 2022
Two Babies One Fox Full Comic Pdf
Boxer Puppies For Sale In Amish Country Ohio
Summoners War Update Notes
They Cloned Tyrone Showtimes Near Showbiz Cinemas - Kingwood
Dairy Queen Lobby Hours
The Rise of "t33n leaks": Understanding the Impact and Implications - The Digital Weekly
Graphic Look Inside Jeffrey Dresser
Tamil Play.com
Indiana Immediate Care.webpay.md
Vip Lounge Odu
Back to the Future Part III | Rotten Tomatoes
Retire Early Wsbtv.com Free Book
140000 Kilometers To Miles
18006548818
Disassemble Malm Bed Frame
Tinfoil Unable To Start Software 2022
705 Us 74 Bus Rockingham Nc
Washington Craigslist Housing
Gelato 47 Allbud
Cars & Trucks near Old Forge, PA - craigslist
Ciara Rose Scalia-Hirschman
Noelleleyva Leaks
How to Choose Where to Study Abroad
Latest Posts
Article information

Author: Amb. Frankie Simonis

Last Updated:

Views: 5562

Rating: 4.6 / 5 (56 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Amb. Frankie Simonis

Birthday: 1998-02-19

Address: 64841 Delmar Isle, North Wiley, OR 74073

Phone: +17844167847676

Job: Forward IT Agent

Hobby: LARPing, Kitesurfing, Sewing, Digital arts, Sand art, Gardening, Dance

Introduction: My name is Amb. Frankie Simonis, I am a hilarious, enchanting, energetic, cooperative, innocent, cute, joyous person who loves writing and wants to share my knowledge and understanding with you.