The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we've ever scanned forever, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This is how Datastax and Cassandra make it easy
2. A Little About Us
Company – Security & Compliance for Social
Launched April 2013 - Series A from Sierra & WindForce Ventures
– 15 employees, 7 in Engineering (2 Data Scientists)
Security guys from:
Customers:
3. Key Enterprise Pain Points
① Brand social account sprawl
• Can‟t inventory, audit, track social media
infrastructure
• Can‟t continuously find fake accounts
② Inbound protection for accounts
• Nothing to detect and remediate account
anomalies / hacks
• No automated coverage for volumes of
inappropriate and malicious content
③ Outbound compliance controls
• Too many admins and apps installed
across multiple accounts
• Little or no automated coverage for
sensitive and regulated data
Novartis Slapped
by the FDA
FINRA begins social
compliance audits
Spam
4. Where Nexgate Fits
Protecting the social account itself
Nexgate
Protect branded accounts and ensure compliance
Find, audit, and track the actual social accounts of the brand
Catch & remediate social account hacks, tampering, and misuse
Remove bad „inbound‟ content including spam, malware, and acceptable use
Enforce usage of approved publishing platforms
Comply with regulations using prebuilt content policies, workflow, and intelligent archiving
Listening Platforms
Mine external social data and conversations
• Find brand „mentions‟ and present them with inferences
• Provide volumes of market data that is analyzed for trends, share of voice, etc.
• Social CRM identification of key conversations and influencers that may need engagement
Publishing Platforms
Engage audiences and track outcomes
• Build communities
• Deliver content, custom apps, ads with workflow
• Promotions, contests, and campaigns
5. :001> Content classification is what we do. The completeness of any
classification system is predicated on the breadth of the corpus of data upon
which it is built.
8. :004> Social data is small and jagged.
• Average 1K all in, content and metadata
• Some common small stuff: time, social IDs, parent, account
• Some common big stuff: content, links
• Lots of disparate stuff, specific to the social platform
9. :005>
Keep in SQL: Fixed length, non-null, heavily indexed, group
access
Keep in NoSQL: Variable length, commonly null, non
indexed, single access, text search
10. :006> Requirements
• Simple, proven horizontal scalability
• Integrated tools for research: search, analysis
• Operational simplicity; nodes all the same
• Enterprise support
11. :007> Deployment
• Multi-region AWS
• M1 Large instances
• Instance attached storage
• About to scale again
• Separate dev, test, prod clusters
Datastax:
• Start-up pricing, per-core pricing
• On site experts, responsive support
12. Over 250 million pieces of social
media total content spread across
Facebook, Twitter, YouTube,
Google+, LinkedIn
Currently about half a million new
content per day
– All classified in real time as it
comes in
About 50,000 new social media
content authors per day
Cassandra is a great choice for a
database– allows flexibility for the
ever rapidly-changing landscape of
social media threats
Scale of Data
14. Among the many security and compliance
classifications that Nexgate provides, we also
have powerful spam detection
Spam can be a single link directing to a
fraudulent site (screenshots of a Facebook
comment):
Fighting Spam with
Cassandra
15. Or it can be less obvious, and more personal. This is extremely common.
Here, the same user has posted the same message across different social
media accounts (screenshot taken from Nexgate product):
16. Social media spam grew by
355% in the first half of 2013.
Get the report at http://nx.gt/SocialSpamReport
17. Can create Spam signatures to catch this
type of content
...but it would be too slow to catch Spam in
real time.
Cassandra
Cassandra and
Social Media Spam
18. Even though Cassandra is a NoSQL schema-
less database, it is worth carefully defining
the data model
Can‟t just “throw data at it” – can make for
some really inefficient queries
Define the data model based on how you will
query the data
For us, we want to determine spam content
that has been posted duplicate times
– Spammers tend to post same-content messages
Define Your Data Model
19. Typical table in Cassandra
– Wide “unconstrained” rows is a nice feature w.r.t. SQL
Spam Multiplicity Data Model
Row key -> hash of content
Column Key -> Unique ID (strictly increasing with time)
Column Value -> Item_id and time of post
20. Spammers typically post the same content over and over
Easy to determine how many times a same-content post is made:
check the number of columns
Will never double count because the column key will simply be
updated instead of added
Indexed by the content, so quick reads and writes
By reading the column value, can extract the time series information
of duplicated posts
– Can also map back to the original value – we store actual content
indexed by the item_id in another Cassandra table
Cassandra not a magic bullet
– still need a relational database to glue all the pieces of data together
– Batch processing may need other tools like Hadoop
Why this Data Model ?
21.
22. This has become invaluable to us for catching spam content in real
time – the following “rant” comment was posted 38 times…
– Brand can more easily moderate given automated tools
Real-world spam multiplicity
In another example, a customer received 25,000 inappropriate
messages, and this tool helped us automate content removal
23. Another way to tackle real-time spam is by
identifying spammy users
– Since Cassandra effortlessly keeps all the
content we observed, our algorithm takes into
account all the posts contributed by an author
to determine if they are a spammer
Additionally, it is important to keep all data
to train our 100+ classifiers
Importance of Keeping All Data
24. Cassandra actually has been humming along quite nicely!
– Barely any tweaking needed from default values
– No deletes (just the nature of our dataset) => not a lot of frequent
repairs performed (repair is done to resolve inconsistencies across
all replicas of data due to deletes)
• Fine for us, because repair requires intensive disk I/O
Only times we observed performance issues:
– When the rates of our reads and writes reached a certain threshold
– When the size of the data being inserted was too large
– Heap memory issue with Cassandra 1.1.x
In all cases, Datastax provided a quick and simple solution,
mostly just toggling a few parameters in config files and
restarting the nodes
Tuning Cassandra
25. Community is wonderful - it's really easy to jump on the
Cassandra IRC channel and talk to fellow users and
developers to get real-time feedback.
– With IRC and mailing list help, implemented composite columns
to detect malware sites on the second day of using Cassandra 3
years ago
In fact, when we tested a migration to the latest version of
Casandra, and one of our Ruby wrappers didn't play nice with
CQL3, I was able to speak directly with the Ruby wrapper
author on IRC and received a reason on why it didn't work.
– In the same day, I committed and made a pull request for a fix to
the Ruby wrapper on github, and the author looked at it the next
morning
Datastax support has been invaluable for providing fast
feedback and simple solutions
Cassandra Community
26. OpsCenter helpful in debugging
performance issues
Solr – used to obtain training data for
classifiers by phrase matching
Looking forward:
– Datastax Hadoop support to look into training
labeled data with MapReduce
Datastax Additional Tools
27. Thank you Datastax and RelateIQ!
Let us show you: nexgate.com/demo
Follow us:
@NXGate
facebook.com/NXGate
Notas del editor
Understanding and managing the touch points and scale of your social presenceLow barrier to adoption => unmanaged account sprawlFocus on sentiment alone => miss activity, risks, and opportunities on what your company is responsible forManual moderation/measurement => Rapidly rising costs, reduced effectiveness, risks of PR crisesEstablishing governance policies and processes to protect your brandSocial accounts & applications live outside the corporate network => corporate governance and security risksSiloed account owners => no auditing of account accessManual moderation for content on accounts => higher probability for errors and crises