Whisper, an open-source automatic speech recognition system, has been trained on an extensive dataset comprising 680,000 hours of multilingual and multitask supervised data sourced from the web. Its primary objective is to exhibit resilience in the face of diverse accents, background noise, and technical jargon, enabling it to accurately transcribe and translate speech from various languages into English. This sophisticated system follows a straightforward end-to-end approach, leveraging an encoder-decoder Transformer architecture. In addition to its transcription capabilities, Whisper is equipped with language identification features and the ability to provide phrase-level timestamps. It has been thoughtfully designed to prioritize user-friendliness and achieve remarkable precision, empowering developers to seamlessly integrate voice interfaces into a wide range of applications.