Fine-Grained Activity Recognition for Assembly Videos

Jonathan D. Jones, Cathryn Cortesa, Amy Shelton, Barbara Landau, Sanjeev Khudanpur, Gregory D. Hager

Research output: Contribution to journalArticlepeer-review


In this letter we address the task of recognizing assembly actions as a structure (e.g. a piece of furniture or a toy block tower) is built up from a set of primitive objects. Recognizing the full range of assembly actions requires perception at a level of spatial detail that has not been attempted in the action recognition literature to date. We extend the fine-grained activity recognition setting to address the task of assembly action recognition in its full generality by unifying assembly actions and kinematic structures within a single framework. We use this framework to develop a general method for recognizing assembly actions from observation sequences, along with observation features that take advantage of a spatial assembly's special structure. Finally, we evaluate our method empirically on two application-driven data sources: 1) An IKEA furniture-assembly dataset, and 2) A block-building dataset. On the first, our system recognizes assembly actions with an average framewise accuracy of 70% and an average normalized edit distance of 10%. On the second, which requires fine-grained geometric reasoning to distinguish between assemblies, our system attains an average normalized edit distance of 23% - a relative improvement of 69% over prior work.

Original languageEnglish (US)
Article number9372803
Pages (from-to)3728-3735
Number of pages8
JournalIEEE Robotics and Automation Letters
Issue number2
StatePublished - Apr 2021


  • Probabilistic Inference
  • assembly
  • multi-modal perception for HRI
  • recognition
  • sensor fusion

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Biomedical Engineering
  • Human-Computer Interaction
  • Mechanical Engineering
  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Control and Optimization
  • Artificial Intelligence


Dive into the research topics of 'Fine-Grained Activity Recognition for Assembly Videos'. Together they form a unique fingerprint.

Cite this