In this paper, we propose a novel emotion recognition method to reflect affect salient information using acoustic and lexical features. The acoustic features are extracted from the speech signal by applying statistical functionals of emotionally high-level features derived from Deep Neural Network(DNN). These acoustic features are early fused with two types of lexical features extracted from the text transcription of the speech signal, which are the distributed representation and affective lexicon-based dimensions. The fused features are fed to another DNN for utterance-level emotion classification. Experimental results on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) multimodal dataset showed 75.5% in unweighted accuracy recall, which outperformed the best results reported previously in the multimodal emotion recognition using acoustic and lexical features.