Code and Named Entity Recognizer for StackOverflow
Jeniya Tabassum, Mounica Maddela, Wei Xu, Alan Ritter
Proceedings of ACL 2020

Please enter the sentence, or try one of our sample senteces: Example 1 Example 2 Example 3

Extracting entities ... This may take upto a minute.


We have introduced a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained indomain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which led to an absolute increase of +10 F1 score over off-the-shelf BERT. Our proposed SoftNER model achieves an overall 79.10 F1 score for code and named entity recognition on StackOverflow by incorporating a context-independent code token classifier with corpus-level features and outperfroms the BERT-based tagging model.

An example protocol, as seen in Figure. 1 of our paper.

15,000+ StackOverflow Sentences

We present a fine-grained NER corpus of 15,372 StackOverflow sentences with fine grained software entities. These sentences are collected from 1,237 randomly selected question-answer threads from StackOverflow 10-year archive. For each question, four answers were annotated, including the accepted answer, the most upvoted answer, as well as two randomly selected answers (if they exist).

Our full annotated coprus is available here. This NER corpus contains both code related and natural language entities. The 8 code entities include mentions of CLASS, VARIABLE, IN LINE CODE, FUNCTION, LIBRARY, VALUE, DATA TYPE, and HTML XML TAG. Whereas the 12 natural language entities include mentions of APPLICATION, UI ELEMENT, LANGUAGE, DATA STRUCTURE, ALGORITHM, FILE TYPE, FILE NAME, VERSION, DEVICE, OS, WEBSITE, and USER NAME.

An annotated sentence from one of the protocols in our corpus. This is visualized with the help of BRAT tool.

The dataset was annotated with the help of BRAT tool by four undergraduate students, majoring in computer science. The annotation process was followed by an adjudication step, which addressed the inter-annotator disagreement over the 40% double annotated data and ensured the sanity of the corpus by cross checking the 60% single annotated data.