Threat intelligence research and learning
Threat intelligence research and learning
Tang S, Mi X, Li Y , et al. Clues in Tweets: Twitter-Guided Discovery and Analysis of SMS Spam. ACM Conference on Computer and Communications Security. 2022.
主要研究问题或假设:
With its critical role in business and service delivery through mobile devices, SMS (Short Message Service) has long been abused for spamming, which is still on the rise today possibly due to the emergence of A2P bulk messaging. The effort to control SMS spam has been hampered by the lack of up-to-date information about illicit activities.
……
针对短信服务(SMS)被网络犯罪分子滥用于钓鱼、广告、垃圾信息等,而现有ort系统并不能有效的控制垃圾短信的活动。对推文中的垃圾短信进行分析和研究,并提出改进方案
研究方法
To identify the reported SMS spam and extract their content, we developed a new technique to automatically identify the tweets reporting SMS spam and accurately recover from its image attachments spam messages. Our approach, called SpamHunter, runs a pipeline that first uses a set of keywords to collect tweets through Tweet APIs, then filters these tweets with image object detection to identify those including SMS screenshots (particularly, an SMS dialog box or text cell), and finally classifies the tweets with SMS screenshots as spam-reporting or not, using a natural language processing (NLP) and machine learning (ML) model. These confirmed spam-reporting tweets are further inspected to extract message content from the attached SMS screenshots, by intersecting the SMS text cell with the text paragraphs captured using a Google Vision API.
……
设计并实施了SpamHunter,用于收集垃圾短信数据集,并从其中的截图中提取出相应的信息内容。
持续监控新出现的垃圾短信数据,进行深入测量研究了解垃圾邮件的发送策略和趋势
利用所搜集到的数据对短信生态系统(包括反垃圾邮件服务、短信群发服务和短信应用程序)进行评估,并给出改进建议
重要的研究发现或结论、创新点
发现可以持续收集高质量、多样和公开垃圾短信数据的渠道->推文
经SpamHunter长达四年的收集,已成为迄今为止公开的最大垃圾短信数据集
从垃圾短信所携带的网址可以看出,相关的钓鱼网站或恶意网站采用了多种托管方式,包括防弹托管服务(如shinjiru)、端口转发服务(如ngrok.io)、动态 DNS服务(如duckdns.org)和任意播送IP,这表明网络犯罪的组织有周密的计划和充足的资金。
对获取到的垃圾数据集进行分析,得出结论……
创新:
对作为垃圾邮件报告系统的Twitter的测量揭示了垃圾邮件报告用户的行为和对其行动的反应,这在以前从未有过研究。
研究了当今的短信服务和短信应用程序是否能提供足够的保护。->当今短信垃圾数据检测能力仍有所欠缺
。。。。。。略
Main-tech
该论文提出了名为SpamHunter的系统:
通过上述的流程图可以直观的看出该系统由以下部分组成:
Tweet Collector :用于从获取原始目标数据(Raw Tweets)。调用Twitter Academic API搭配关键字、并限制推文至少含有一张图片
SMS Lmage Detector :从Raw Tweets中,进一步识别出包括短信截图的推文(SMS Images)。我们可以利用高效、准确的物体检测算法-Yolov3(开源https://github.com/developer0hye/Yolo_Label),来构建SMS图像检测器。
Spam-Reporting Tweet Classifier :去除无关推文数据,进一步提取有效实验数据(Spam-Reporting Tweets)。(报告拉圾邮件的推文往往带有负面情绪,如对垃圾邮件的不满。)->通过机器学习训练出情感模型(3层神经网络)来判断推文中用户情绪以此提取出目标数据。
SMS Text Recognizer :整合SMS Lmage Detector和Spam-Reporting Tweet Classifier收集到的数据,提取截图中的文本信息。
some part
收集各种原始推文数据Raw Tweets:通过Tweet API搭配关键字检索模拟请求获取原始推文
runs a pipeline that first uses a set of keywords to collect tweets through Tweet APIs, then filters these tweets with image object detection to identify those including SMS screenshots (particularly, an SMS dialog box or text cell)……
采取物体检测算法-Yolov3(https://github.com/developer0hye/Yolo_Label),以IoU(Interaction over Union)作为度量:,将带有SMS屏幕截图的tweet分类为垃圾邮件报告
classifies the tweets with SMS screenshots as spam-reporting……
To measure the localization accuracy of bounding boxes, we used Interaction over Union (IoU) as the metric
使用自然语言处理(NLP)和机器学习(ML)训练情感模型,根据推文文字所呈现的情感来辨别目标数据
using a natural language processing (NLP) and machine learning (ML) model.
通过将短信文本单元与使用谷歌视觉应用程序接口(Google Vision API)捕捉到的文本准图形相交,设置阈值:
来识别截图。
These confirmed spam-reporting tweets are further inspected to extract message content from the attached SMS screenshots, by intersecting the SMS text cell with the text paragraphs captured using a Google Vision API.
For this purpose, we applied the Text Documentation Detection API [7] provided by Google Vision to each SMS image, which recovers text paragraphs along with their coordinates on the image. Then, the coordinates of each paragraph, as bounded by a “paragraph box” on the image, are compared with those of an identied text cell; it is considered to be part of an SMS message if most of the paragraph is within the text cell.
重难点就在于如何训练出高效正确的情感模型,来提高数据筛选的准确率。以及提取截图的文本内容采用的算法,和其中阈值的确定,来缩小文本识别的误差。而不是采用常见的ocr技术
比较可惜的是文字并未对训练过程进行过多的阐述,这个就需要自己去摸索了
之后就是各种数据分析了。。。。以后再补充