If we count the appearance of words in a sample of (most) human languages, it's visible that they have the Zipf's distribution. It can be used to distinguish human languages (and humans) from texts generated randomly (by spambots). This is presented on below histogram:
Below I will present tools that I made to verify this, first of them is a C++ program used to parse a text and generate a distribution of words that he encountered, second is a R script used to generate diagram from mentioned distribution.
C++ parser:
#include <iostream> #include <string> #include <algorithm> #include <fstream> #include <sstream> #include <vector> #include <map> #include <boost/foreach.hpp> #include <boost/tokenizer.hpp> using namespace std; using namespace boost; int main(int argc, char* argv[]) { if (2 != argc) { cout << "usage: " << argv[0] << " filename" <<endl; return EXIT_SUCCESS; } // read whole file into string ifstream t(argv[1]); stringstream fileBuffer; fileBuffer << t.rdbuf(); string text = fileBuffer.str(); // make content of file lower case transform(text.begin(), text.end(), text.begin(), ::tolower); // create hash, where key = word, value = amount of this word in text char_separator<char> sep(" \t\n-;.,"); tokenizer< char_separator<char> > tokens(text, sep); map<string, unsigned> words; typedef std::pair<string, unsigned> wordPairType; BOOST_FOREACH (string t, tokens) { bool isIn = words.find(t) != words.end(); words[t] = isIn ? words[t] + 1 : 1; } // create vector with amounts of all words in text vector<unsigned> distribution; BOOST_FOREACH (wordPairType t, words) { distribution.push_back(t.second); } // amounts of words needs to be sorted sort(distribution.rbegin(), distribution.rend()); // show results BOOST_FOREACH (unsigned i, distribution) { cout << i << endl; } return 0; }
R script
args <- commandArgs(TRUE) sizes <- scan(args[1]) png(filename = "results.png", height = 500, width = 700, bg = "white") plot(sizes, xlab = "words", ylab = "occurences", type="l",log="yx" )
Assuming that program was compiled to ./a.out, R script was saved as chart.r, sample was named as test4.txt, execution of below command should save the histogram to results.png.
bash-3.2$ ./a.out test4.txt > results.txt && Rscript chart.r results.txt Read 314 items
0 commentaires:
Post a Comment