pWord2Vec

This is a C++ implementation of word2vec that is optimized on Intel CPUs, particularly, Intel Xeon and Xeon Phi (Knights Landing) processors. It supports the "HogBatch" parallel SGD as described in "Parallelizing Word2vec in Shared and Distributed Memory". A short NIPS workshop version can be found here. It also uses data parallelism to distribute the computation via MPI over a CPU cluster.

The code is developed based on the original word2vec implementation from Google.

License

All source code files in the package are under Apache License 2.0.

Prerequisites

The code is developed and tested on UNIX-based systems with the following software dependencies:

Intel Compiler (The code is optimized on Intel CPUs)
OpenMP (No separated installation is needed once Intel compiler is installed)
MKL (The latest version "16.0.0 or higher" is preferred as it has been improved significantly in recent years)
MPI library, with multi-threading support (Intel MPI, MPICH2 or MVAPICH2 for distributed word2vec only)
HyperWords (for model accuracy evaluation)
Numactl package (for multi-socket NUMA systems)

Environment Setup

Install Intel C++ development environment (i.e., Intel compiler, OpenMP, MKL "16.0.0 or higher" and iMPI. free copies are available for some users)
Enable Intel C++ development environment

source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 (please point to the path of your installation)
source /opt/intel/impi/latest/compilers_and_libraries/linux/bin/compilervars.sh intel64 (please point to the path of your installation)

Install numactl package

sudo yum install numactl (on RedHat/Centos)
sudo apt-get install numactl (on Ubuntu)

Quick Start

Download the code: git clone https://github.com/IntelLabs/pWord2Vec
Run .\install.sh to build the package (e.g., it downloads hyperwords and compiles the source code.)
Note that this installation will try to produce two binaries: pWord2Vec and pWord2Vec_mpi. If you are only interested in the non-mpi version of w2v, you don't need to set up mpi and the compilation will fail on building pWord2Vec_mpi of course. But you can still use the non-mpi binary for the rest of single machine demos.
Download the data: cd data; .\getText8.sh or .\getBillion.sh
Run the demo script: cd sandbox; ./run_single_text8.sh (for single machine demo) or ./run_mpi_text8.sh (for distributed w2v demo)
Run the code on the 1-billion-word-benchmark: cd billion; ./run_single.sh (for single machine w2v) or ./run_mpi.sh (for distributed w2v) (please set ncores=number of logical cores of your machine)
Evaluate the models: cd sandbox; ./eval.sh or cd billion; ./eval.sh

Reference

Parallelizing Word2Vec in Shared and Distributed Memory, arXiv, 2016.
Parallelizing Word2Vec in Multi-Core and Many-Core Architectures, in NIPS workshop on Efficient Methods for Deep Neural Networks, Dec. 2016.

For questions and bug reports, you can reach me at https://cs.gsu.edu/~sji/

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
billion		billion
data		data
sandbox		sandbox
LICENSE		LICENSE
README.md		README.md
clean.sh		clean.sh
install.sh		install.sh
makefile		makefile
pWord2Vec.cpp		pWord2Vec.cpp
pWord2Vec_mpi.cpp		pWord2Vec_mpi.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pWord2Vec

License

Prerequisites

Environment Setup

Quick Start

Reference

About

Releases

Packages

Languages

License

JinH101/pWord2Vec

Folders and files

Latest commit

History

Repository files navigation

pWord2Vec

License

Prerequisites

Environment Setup

Quick Start

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages