This is an interview with Gabriel Weinberg, founder of Duck Duck Go and general all around startup guru, on what DDG’s architecture looks like in 2012.
An app offering real-time translations is to allow people in Japan to speak to foreigners over the phone with both parties using their native tongue.
NTT Docomo - the country's biggest mobile network - will initially convert Japanese to English, Mandarin and Korean, with other languages to follow.
Even though the translations are bound to be hilariously bad sometimes, this may still be useful in some situations.
33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/, posted 2012 by peter in language nlp privacy science
So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.
Pattern is a web mining module for the Python programming language.
It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).
The module is bundled with 30+ example scripts.
This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used. It works on news articles and blogs pages with worthwhile text content, among others…
Jellyfish is a python library for doing approximate and phonetic matching of strings.
String comparison: * Levenshtein Distance * Damerau-Levenshtein Distance * Jaro Distance * Jaro-Winkler Distance * Match Rating Approach Comparison * Hamming Distance
* American Soundex * Metaphone * NYSIIS (New York State Identification and Intelligence System) * Match Rating Codex
Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.
VisualText is the premier integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. The Professional version is now FREE for personal, internal, academic, development, and non-commercial use.
The main module, Lingua::EN::MatchNames, exports one function by default: name_eq(). You can either feed it four parameters:
name_eq( $firstname0, $lastname0, $firstname1, $lastname1 ) or two (thanks to Lingua::EN::NameParse, which breaks full names into their constituent components):
name_eq( $name0, $name1 ) and it will return a certainty score between 0 and 100, or undef if the names cannot be matched via any method known to the module.
Researchers at Germany's Karlsruhe Institute of Technology (KIT) have developed a method for mobile phones to convert silent mouth movements into speech. The technology is based on the principle of electromyography, that is the acquisition and recording of electrical potentials generated by muscle activity. This muscle activity is measured in the face and converted into speech.
An example is soundless calling.
The user can speak into the phone soundlessly, but is still understood by the conversation partner on the other end of the line. As a result, it is possible to communicate in silent environments, at the cinema or theater, without disturbing others. Another field of use is the transmission of confidential information.