基于kaldi的iOS實(shí)時(shí)語(yǔ)音識(shí)別(本地)+05+解碼

iOS在線識(shí)別:http://www.reibang.com/u/3c2a0bd52ebc

前面部分講的跟語(yǔ)音識(shí)別關(guān)系不大炕吸,這部分開(kāi)始講解語(yǔ)音識(shí)別相關(guān)的內(nèi)容,首先接上部分內(nèi)容背亥,來(lái)講解一下語(yǔ)音識(shí)別的解碼秒际,即輸入音頻輸出文本。

先看一下kaldi給出的解碼參考代碼:

int main(int argc, char *argv[]) {
  try {
    using namespace kaldi;
    using namespace fst;

    typedef kaldi::int32 int32;
    typedef kaldi::int64 int64;

    const char *usage =
        "Reads in wav file(s) and simulates online decoding with neural nets\n"
        "(nnet3 setup), with optional iVector-based speaker adaptation and\n"
        "optional endpointing.  Note: some configuration values and inputs are\n"
        "set via config files whose filenames are passed as options\n"
        "\n"
        "Usage: online2-wav-nnet3-latgen-faster [options] <nnet3-in> <fst-in> "
        "<spk2utt-rspecifier> <wav-rspecifier> <lattice-wspecifier>\n"
        "The spk2utt-rspecifier can just be <utterance-id> <utterance-id> if\n"
        "you want to decode utterance by utterance.\n";

    ParseOptions po(usage);

    std::string word_syms_rxfilename;

    // feature_opts includes configuration for the iVector adaptation,
    // as well as the basic features.
    OnlineNnet2FeaturePipelineConfig feature_opts;
    nnet3::NnetSimpleLoopedComputationOptions decodable_opts;
    LatticeFasterDecoderConfig decoder_opts;
    OnlineEndpointConfig endpoint_opts;

    BaseFloat chunk_length_secs = 0.18;
    bool do_endpointing = false;
    bool online = true;

    po.Register("chunk-length", &chunk_length_secs,
                "Length of chunk size in seconds, that we process.  Set to <= 0 "
                "to use all input in one chunk.");
    po.Register("word-symbol-table", &word_syms_rxfilename,
                "Symbol table for words [for debug output]");
    po.Register("do-endpointing", &do_endpointing,
                "If true, apply endpoint detection");
    po.Register("online", &online,
                "You can set this to false to disable online iVector estimation "
                "and have all the data for each utterance used, even at "
                "utterance start.  This is useful where you just want the best "
                "results and don't care about online operation.  Setting this to "
                "false has the same effect as setting "
                "--use-most-recent-ivector=true and --greedy-ivector-extractor=true "
                "in the file given to --ivector-extraction-config, and "
                "--chunk-length=-1.");
    po.Register("num-threads-startup", &g_num_threads,
                "Number of threads used when initializing iVector extractor.");

    feature_opts.Register(&po);
    decodable_opts.Register(&po);
    decoder_opts.Register(&po);
    endpoint_opts.Register(&po);


    po.Read(argc, argv);

    if (po.NumArgs() != 5) {
      po.PrintUsage();
      return 1;
    }

    std::string nnet3_rxfilename = po.GetArg(1),
        fst_rxfilename = po.GetArg(2),
        spk2utt_rspecifier = po.GetArg(3),
        wav_rspecifier = po.GetArg(4),
        clat_wspecifier = po.GetArg(5);

    OnlineNnet2FeaturePipelineInfo feature_info(feature_opts);

    if (!online) {
      feature_info.ivector_extractor_info.use_most_recent_ivector = true;
      feature_info.ivector_extractor_info.greedy_ivector_extractor = true;
      chunk_length_secs = -1.0;
    }

    TransitionModel trans_model;
    nnet3::AmNnetSimple am_nnet;
    {
      bool binary;
      Input ki(nnet3_rxfilename, &binary);
      trans_model.Read(ki.Stream(), binary);
      am_nnet.Read(ki.Stream(), binary);
      SetBatchnormTestMode(true, &(am_nnet.GetNnet()));
      SetDropoutTestMode(true, &(am_nnet.GetNnet()));
      nnet3::CollapseModel(nnet3::CollapseModelConfig(), &(am_nnet.GetNnet()));
    }

    // this object contains precomputed stuff that is used by all decodable
    // objects.  It takes a pointer to am_nnet because if it has iVectors it has
    // to modify the nnet to accept iVectors at intervals.
    nnet3::DecodableNnetSimpleLoopedInfo decodable_info(decodable_opts,
                                                        &am_nnet);


    fst::Fst<fst::StdArc> *decode_fst = ReadFstKaldiGeneric(fst_rxfilename);

    fst::SymbolTable *word_syms = NULL;
    if (word_syms_rxfilename != "")
      if (!(word_syms = fst::SymbolTable::ReadText(word_syms_rxfilename)))
        KALDI_ERR << "Could not read symbol table from file "
                  << word_syms_rxfilename;

    int32 num_done = 0, num_err = 0;
    double tot_like = 0.0;
    int64 num_frames = 0;

    SequentialTokenVectorReader spk2utt_reader(spk2utt_rspecifier);
    RandomAccessTableReader<WaveHolder> wav_reader(wav_rspecifier);
    CompactLatticeWriter clat_writer(clat_wspecifier);

    OnlineTimingStats timing_stats;

    for (; !spk2utt_reader.Done(); spk2utt_reader.Next()) {
      std::string spk = spk2utt_reader.Key();
      const std::vector<std::string> &uttlist = spk2utt_reader.Value();
      OnlineIvectorExtractorAdaptationState adaptation_state(
          feature_info.ivector_extractor_info);
      for (size_t i = 0; i < uttlist.size(); i++) {
        std::string utt = uttlist[i];
        if (!wav_reader.HasKey(utt)) {
          KALDI_WARN << "Did not find audio for utterance " << utt;
          num_err++;
          continue;
        }
        const WaveData &wave_data = wav_reader.Value(utt);
        // get the data for channel zero (if the signal is not mono, we only
        // take the first channel).
        SubVector<BaseFloat> data(wave_data.Data(), 0);

        OnlineNnet2FeaturePipeline feature_pipeline(feature_info);
        feature_pipeline.SetAdaptationState(adaptation_state);

        OnlineSilenceWeighting silence_weighting(
            trans_model,
            feature_info.silence_weighting_config,
            decodable_opts.frame_subsampling_factor);

        SingleUtteranceNnet3Decoder decoder(decoder_opts, trans_model,
                                            decodable_info,
                                            *decode_fst, &feature_pipeline);
        OnlineTimer decoding_timer(utt);

        BaseFloat samp_freq = wave_data.SampFreq();
        int32 chunk_length;
        if (chunk_length_secs > 0) {
          chunk_length = int32(samp_freq * chunk_length_secs);
          if (chunk_length == 0) chunk_length = 1;
        } else {
          chunk_length = std::numeric_limits<int32>::max();
        }

        int32 samp_offset = 0;
        std::vector<std::pair<int32, BaseFloat> > delta_weights;

        while (samp_offset < data.Dim()) {
          int32 samp_remaining = data.Dim() - samp_offset;
          int32 num_samp = chunk_length < samp_remaining ? chunk_length
                                                         : samp_remaining;

          SubVector<BaseFloat> wave_part(data, samp_offset, num_samp);
          feature_pipeline.AcceptWaveform(samp_freq, wave_part);

          samp_offset += num_samp;
          decoding_timer.WaitUntil(samp_offset / samp_freq);
          if (samp_offset == data.Dim()) {
            // no more input. flush out last frames
            feature_pipeline.InputFinished();
          }

          if (silence_weighting.Active() &&
              feature_pipeline.IvectorFeature() != NULL) {
            silence_weighting.ComputeCurrentTraceback(decoder.Decoder());
            silence_weighting.GetDeltaWeights(feature_pipeline.NumFramesReady(),
                                              &delta_weights);
            feature_pipeline.IvectorFeature()->UpdateFrameWeights(delta_weights);
          }

          decoder.AdvanceDecoding();

          if (do_endpointing && decoder.EndpointDetected(endpoint_opts)) {
            break;
          }
        }
        decoder.FinalizeDecoding();

        CompactLattice clat;
        bool end_of_utterance = true;
        decoder.GetLattice(end_of_utterance, &clat);

        GetDiagnosticsAndPrintOutput(utt, word_syms, clat,
                                     &num_frames, &tot_like);

        decoding_timer.OutputStats(&timing_stats);

        // In an application you might avoid updating the adaptation state if
        // you felt the utterance had low confidence.  See lat/confidence.h
        feature_pipeline.GetAdaptationState(&adaptation_state);

        // we want to output the lattice with un-scaled acoustics.
        BaseFloat inv_acoustic_scale =
            1.0 / decodable_opts.acoustic_scale;
        ScaleLattice(AcousticLatticeScale(inv_acoustic_scale), &clat);

        clat_writer.Write(utt, clat);
        KALDI_LOG << "Decoded utterance " << utt;
        num_done++;
      }
    }
    timing_stats.Print(online);

    KALDI_LOG << "Decoded " << num_done << " utterances, "
              << num_err << " with errors.";
    KALDI_LOG << "Overall likelihood per frame was " << (tot_like / num_frames)
              << " per frame over " << num_frames << " frames.";
    delete decode_fst;
    delete word_syms; // will delete if non-NULL.
    return (num_done != 0 ? 0 : 1);
  } catch(const std::exception& e) {
    std::cerr << e.what();
    return -1;
  }
} // main()

我們的解碼:

static void kaldidecoder_decode_segment(Gstkaldidecoder * filter,
                                                        bool &more_data,
                                                        int32 chunk_length,
                                                        BaseFloat traceback_period_secs) {
  OnlineNnet2FeaturePipeline feature_pipeline(*(filter->feature_info));
  feature_pipeline.SetAdaptationState(*(filter->adaptation_state));
  SingleUtteranceNnet3Decoder decoder(*(filter->decoder_opts),
                                      *(filter->trans_model), 
                                      *(filter->decodable_info_nnet3),
                                      *(filter->decode_fst),
                                      &feature_pipeline);
  OnlineSilenceWeighting silence_weighting(*(filter->trans_model),
          *(filter->silence_weighting_config));

  Vector<BaseFloat> wave_part = Vector<BaseFloat>(chunk_length);
  std::vector<std::pair<int32, BaseFloat> > delta_weights;
  DEBUG_OBJECT(filter, "Reading audio in %d sample chunks...",
                   wave_part.Dim());
  BaseFloat last_traceback = 0.0;
  BaseFloat num_seconds_decoded = 0.0;
  while (true) {
    more_data = filter->audio_source->Read(&wave_part);

    feature_pipeline.AcceptWaveform(filter->sample_rate, wave_part);
    if (!more_data) {
      feature_pipeline.InputFinished();
    }

    if (silence_weighting.Active() && 
        feature_pipeline.IvectorFeature() != NULL) {
      silence_weighting.ComputeCurrentTraceback(decoder.Decoder());
      silence_weighting.GetDeltaWeights(feature_pipeline.IvectorFeature()->NumFramesReady(), 
                                        &delta_weights);
      feature_pipeline.IvectorFeature()->UpdateFrameWeights(delta_weights);
    }

    decoder.AdvanceDecoding();
    DEBUG_OBJECT(filter, "%d frames decoded", decoder.NumFramesDecoded());
    num_seconds_decoded += 1.0 * wave_part.Dim() / filter->sample_rate;
    filter->total_time_decoded += 1.0 * wave_part.Dim() / filter->sample_rate;
    DEBUG_OBJECT(filter, "Total amount of audio processed: %f seconds", filter->total_time_decoded);
    if (!more_data) {
      break;
    }
    if (filter->do_endpointing
        && (decoder.NumFramesDecoded() > 0)
        && decoder.EndpointDetected(*(filter->endpoint_config))) {
      GST_DEBUG_OBJECT(filter, "Endpoint detected!");
      break;
    }

    if ((num_seconds_decoded - last_traceback > traceback_period_secs)
        && (decoder.NumFramesDecoded() > 0)) {
      Lattice lat;
      decoder.GetBestPath(false, &lat);
      kaldidecoder_partial_result(filter, lat);
      last_traceback += traceback_period_secs;
    }
  }

  if (num_seconds_decoded > 0.1) {
    DEBUG_OBJECT(filter, "Getting lattice..");
    decoder.FinalizeDecoding();
    CompactLattice clat;
    bool end_of_utterance = true;
    decoder.GetLattice(end_of_utterance, &clat);
    DEBUG_OBJECT(filter, "Lattice done");
    if ((filter->lm_fst != NULL) && (filter->big_lm_const_arpa != NULL)) {
      DEBUG_OBJECT(filter, "Rescoring lattice with a big LM");
      CompactLattice rescored_lat;
      if (kaldidecoder_rescore_big_lm(filter, clat, rescored_lat)) {
        clat = rescored_lat;
      }
    }

    guint num_words = 0;
    kaldidecoder_final_result(filter, clat, &num_words);
    if (num_words >= filter->min_words_for_ivector) {
      // Only update adaptation state if the utterance contained enough words
      feature_pipeline.GetAdaptationState(filter->adaptation_state);
    }
  } else {
    DEBUG_OBJECT(filter, "Less than 0.1 seconds decoded, discarding");
  }
}

后面會(huì)詳細(xì)講解解碼

demo:https://github.com/andyweiqiu/asr-ios-local

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末狡汉,一起剝皮案震驚了整個(gè)濱河市娄徊,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌盾戴,老刑警劉巖寄锐,帶你破解...
    沈念sama閱讀 217,657評(píng)論 6 505
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異尖啡,居然都是意外死亡橄仆,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,889評(píng)論 3 394
  • 文/潘曉璐 我一進(jìn)店門衅斩,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)盆顾,“玉大人,你說(shuō)我怎么就攤上這事畏梆∧埽” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 164,057評(píng)論 0 354
  • 文/不壞的土叔 我叫張陵奠涌,是天一觀的道長(zhǎng)蚕涤。 經(jīng)常有香客問(wèn)我,道長(zhǎng)铣猩,這世上最難降的妖魔是什么揖铜? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,509評(píng)論 1 293
  • 正文 為了忘掉前任,我火速辦了婚禮达皿,結(jié)果婚禮上天吓,老公的妹妹穿的比我還像新娘。我一直安慰自己峦椰,他們只是感情好龄寞,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,562評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著汤功,像睡著了一般物邑。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 51,443評(píng)論 1 302
  • 那天色解,我揣著相機(jī)與錄音茂嗓,去河邊找鬼。 笑死科阎,一個(gè)胖子當(dāng)著我的面吹牛述吸,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播锣笨,決...
    沈念sama閱讀 40,251評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼蝌矛,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了错英?” 一聲冷哼從身側(cè)響起入撒,我...
    開(kāi)封第一講書(shū)人閱讀 39,129評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎椭岩,沒(méi)想到半個(gè)月后茅逮,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,561評(píng)論 1 314
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡簿煌,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,779評(píng)論 3 335
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了鉴吹。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片姨伟。...
    茶點(diǎn)故事閱讀 39,902評(píng)論 1 348
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖豆励,靈堂內(nèi)的尸體忽然破棺而出夺荒,到底是詐尸還是另有隱情,我是刑警寧澤良蒸,帶...
    沈念sama閱讀 35,621評(píng)論 5 345
  • 正文 年R本政府宣布技扼,位于F島的核電站,受9級(jí)特大地震影響嫩痰,放射性物質(zhì)發(fā)生泄漏剿吻。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,220評(píng)論 3 328
  • 文/蒙蒙 一串纺、第九天 我趴在偏房一處隱蔽的房頂上張望丽旅。 院中可真熱鬧,春花似錦纺棺、人聲如沸榄笙。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,838評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)茅撞。三九已至,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間米丘,已是汗流浹背剑令。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 32,971評(píng)論 1 269
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留蠕蚜,地道東北人尚洽。 一個(gè)月前我還...
    沈念sama閱讀 48,025評(píng)論 2 370
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像靶累,于是被迫代替她去往敵國(guó)和親腺毫。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 44,843評(píng)論 2 354

推薦閱讀更多精彩內(nèi)容