Natural Language Translator

Client: Global Semiconductors Manufacturer


  • Python
  • C++
  • Assembly


Hardware specific optimizations for Natural Language Translator algorithm.


100x (100 times) improved performance of an algorithm that was already optimized and to identify what sections of the algorithm are suitable for optimization.


The team decomposed the algorithm and ran tests and measurements to identify the steps of the algorithm that are most time consuming. After this analysis, the team started to investigate some quantization and dequantization techniques to improve the performance based on some public whitepapers. Also, some recently developed features for the targeted hardware were needed to get an improvement in performance. All these techniques were new to the team and we had to have a quick ramp up time in order to stick to the project timeline.

The team has been working and analyzing both the high-level model topology (Python code) and the low-level architecture layers, in order to optimize a neural net-based system for the inference step – the actual translation execution.

We used C++ and Assembly to optimize a network topology that translates text from one language to another. We optimized the code for the latest Intel CPU Sky-lake architecture Xeon. In order to optimize, quantization was used along a careful optimized usage of the CPU caches, taking advantage of the AVX512 hardware support. This process required a deep understanding of the hardware architecture and of the network topology.

Moreover, we started working to implement support for the next generation of CPUs, in order to achieve an even higher speed/accuracy optimization.

We created a flexible solution with multiple levels of optimization, each level representing a higher speed and lower accuracy. Users can select among multiple levels of optimization and set their speed vs. accuracy trade-off, as  simulated below.


The project development was split in a few phases to make sure we keep the same functionality and accuracy. The following phases were rolled out during project life:

  • porting of the python algorithm in C++ making sure we keep the functionality and the same level of accuracy. The translations from German to English were 100% exact and featured the same level of accuracy
  • analysis of the algorithm documentation and the quantization whitepapers
  • decomposition of the algorithm in a few sections that were clearly identified
  • apply quantization techniques for some sections of the algorithm, where it was suitable
  • an iterative test and improve phase, were the performance of the algorithm was improved in a few steps. This iterative process was performed until the performance didn’t improve more than 1% between two consecutive steps.

Machine Learning

The latest computer science innovations are improving the quality of automatic translation services in terms of speed and accuracy: topologies like NMT (neural machine translation) can take advantage of newer and more efficient hardware based on CPU architectures like Intel Sky-Lake.

These technologies express their full potential when the software is fully optimized to match the hardware platform.


The client incorporated the new algorithm in a machine learning library delivered with the new hardware.


  • from 8 seconds to 0.11 seconds for 1 sentence
  • 17 levels of optimization to choose from
  • Same accuracy

Other Similar Projects

Looking for a technology partner?

Let’s talk.