This paper presents an attribute and-or grammar (A-AOG) model for jointly inferring human body pose and human attributes in a parse graph with attributes augmented to nodes in the hierarchical representation. In contrast to other popular methods in the current literature that train separate classifiers for poses and individual attributes, our method explicitly represents the decomposition and articulation of body parts, and account for the correlations between poses and attributes. The A-AOG model is an amalgamation of three traditional grammar formulations: (i) Phrase structure grammar representing the hierarchical decomposition of the human body from whole to parts; (ii) Dependency grammar modeling the geometric articulation by a kinematic graph of the body pose; and (iii) Attribute grammar accounting for the compatibility relations between different parts in the hierarchy so that their appearances follow a consistent style. The parse graph outputs human detection, pose estimation, and attribute prediction simultaneously, which are intuitive and interpretable. We conduct experiments on two tasks on two datasets, and experimental results demonstrate the advantage of joint modeling in comparison with computing poses and attributes independently. Furthermore, our model obtains better performance over existing methods for both pose estimation and attribute prediction tasks.